# Text Summarization

This notebook helps to understand how encoder-decoder models may be further finetuned for sequence to sequence tasks such as Summarization. 
In the example below, we will finetune a [facebook/bart-base](https://huggingface.co/facebook/bart-base) model on a news dataset such as [news-qa-summarization](https://huggingface.co/datasets/glnmario/news-qa-summarization). 
Information about about [Rouge](https://huggingface.co/spaces/evaluate-metric/rouge)

In [6]:
import os
import random
import numpy as np
import evaluate
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed,
)


In [10]:
model_name = "facebook/bart-base"  # change to "facebook/bart-large-cnn" for better quality
max_input_length = 1024
max_target_length = 128
per_device_train_batch_size = 2  
per_device_eval_batch_size = 2  
num_epochs = 2  
learning_rate = 5e-5
seed = 42
set_seed(seed)



In [27]:
# Load dataset and inspect schema
raw_dataset = load_dataset("glnmario/news-qa-summarization", split="train")
print(raw_dataset)

Dataset({
    features: ['story', 'questions', 'answers', 'summary'],
    num_rows: 10388
})


In [31]:
type(raw_dataset)

datasets.arrow_dataset.Dataset

In [65]:
train_test = raw_dataset.train_test_split(test_size=0.1, seed=42)
#raw_dataset.select(range(100)).train_test_split(test_size=0.1, seed=42)

train_val = train_test["train"].train_test_split(
    test_size=0.1111, seed=42
)

# 3) Rebuild a DatasetDict with 3 splits
dataset = DatasetDict({
    "train":      train_val["train"],
    "validation": train_val["test"],
    "test":       train_test["test"]
})

In [66]:
len(dataset["train"]), len(dataset["validation"]), len(dataset["test"])

(8310, 1039, 1039)

In [67]:
dataset["train"]

Dataset({
    features: ['story', 'questions', 'answers', 'summary'],
    num_rows: 8310
})

In [68]:
# Initialize tokenizer and model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print("Tokenizer vocab size:", tokenizer.vocab_size)
print("Model params (M):", round(model.num_parameters() / 1e6, 2))



Tokenizer vocab size: 50265
Model params (M): 139.42


In [69]:
def preprocess_function(examples):
    # inputs: articles
    inputs = examples["story"]
    # targets: summaries
    targets = examples["summary"]

    # tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding="max_length",  # or "longest" for on-the-fly padding
    )

    # tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=max_target_length,
            truncation=True,
            padding="max_length",
        )

    # Replace padding token id in labels with -100 so they are ignored by loss
    labels_ids = labels["input_ids"]
    labels_ids = [
        [(lid if lid != tokenizer.pad_token_id else -100) for lid in label]
        for label in labels_ids
    ]
    model_inputs["labels"] = labels_ids

    return model_inputs

In [70]:
feature_list = list(dataset["train"].features)
feature_list

['story', 'questions', 'answers', 'summary']

In [71]:
dataset_tokenized = dataset.map(preprocess_function, batched=True, remove_columns=list(feature_list))

Map:   0%|          | 0/1039 [00:00<?, ? examples/s]



In [72]:
dataset_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8310
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1039
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1039
    })
})

In [73]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="longest",
)

In [74]:
rouge = evaluate.load("rouge")

def postprocess_text(preds, labels):
    preds = [p.strip() for p in preds]
    labels = [l.strip() for l in labels]
    return preds, labels

#np.where(condition, A, B) means:
#if condition is True, take from A
#else (False), take from B

def compute_metrics(eval_pred): # eval_pred is a tuple of predictions, labels
    predictions, labels = eval_pred
    # replace -100 back to pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) 
    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)  
    decoded_preds = tokenizer.batch_decode(
        predictions, skip_special_tokens=True
    )
    
    decoded_labels = tokenizer.batch_decode(
        labels, skip_special_tokens=True
    )

    decoded_preds, decoded_labels = postprocess_text(
        decoded_preds, decoded_labels
    )

    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
    )
    # average Rouge-L / Rouge-1 / Rouge-2
    result = {k: round(v * 100, 2) for k, v in result.items()}

    # also track average generated length
    prediction_lens = [
        np.count_nonzero(p != tokenizer.pad_token_id) for p in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result


In [75]:
training_args = Seq2SeqTrainingArguments(
    output_dir="bart-newsqa-sum",
    eval_strategy="epoch",
    eval_steps=500,
    logging_steps=100,
    save_steps=500,
    save_total_limit=2,
    num_train_epochs=3,
    per_device_train_batch_size=2,  
    per_device_eval_batch_size=4,
    learning_rate=3e-5,
    warmup_ratio=0.03,
    weight_decay=0.01,
    lr_scheduler_type="linear",
    predict_with_generate=True,
    generation_max_length=max_target_length,
    gradient_accumulation_steps=8,   
    fp16=True,                      
    report_to="none",               
)

In [76]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.1912,1.929808,35.4,13.85,24.48,33.14,61.423484
2,1.9688,1.895033,36.48,14.84,25.47,34.22,60.73821
3,1.8586,1.880833,36.14,14.46,25.14,33.88,62.033686




TrainOutput(global_step=1560, training_loss=2.079087428557567, metrics={'train_runtime': 761.0724, 'train_samples_per_second': 32.756, 'train_steps_per_second': 2.05, 'total_flos': 1.52007299039232e+16, 'train_loss': 2.079087428557567, 'epoch': 3.0})

In [77]:

trainer.save_model("bart-newsqa-sum-final")
tokenizer.save_pretrained("bart-newsqa-sum-final")

('bart-newsqa-sum-final/tokenizer_config.json',
 'bart-newsqa-sum-final/special_tokens_map.json',
 'bart-newsqa-sum-final/vocab.json',
 'bart-newsqa-sum-final/merges.txt',
 'bart-newsqa-sum-final/added_tokens.json',
 'bart-newsqa-sum-final/tokenizer.json')

In [79]:
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="bart-newsqa-sum-final",
    tokenizer="bart-newsqa-sum-final",
    device=0,  # or -1 for CPU
)


Device set to use cuda:0


In [83]:

article_sample = raw_dataset[0]["story"]
print(f"\n\n Article: {article_sample}")
summary = (summarizer(article_sample, max_length=128, min_length=20, do_sample=False)[0]["summary_text"])
print(f"\n\n Summary:{summary}")

Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




 Article: 'SINDH KALAY', England (CNN) -- The aroma of freshly baking flatbread wafts through the air as a unit of British soldiers position themselves for a quick patrol around the village of Sindh Kalay. A British soldier on patrol in the mock Afghan village of Sindh Kalay. Market vendors hawk grapes and melons, as a group of village elders sit smoking water pipes and suspicious-looking men lurk beside battered motorcycles. What should the soldiers do? Conduct a weapons search? Approach the village elders first? In the complex political and cultural terrain of Afghanistan, what is the best course of action? Except this is not Afghanistan. It's Norfolk, England. Instead of the Hindu Kush mountains, it is the green ladscape and tidy farmhouses of the English countryside that stretch out behind them. Welcome to the British Army's state-of-the art training ground. It cost more than $20 million to build and every British soldier serving in Afghanistan will do his or her training here. "

In [84]:
# source of the article = https://mediarelations.unibe.ch/media_releases/2025/media_releases_2025/low_blood_sugar_detected_by_speaking_into_a_smartphone/index_eng.html
article_sample = "Low blood sugar detected by speaking into a smartphone. Low blood sugar (hypoglycemia) is a critical diabetes-related condition. Researchers at the Inselspital, Bern University Hospital and the University of Bern have now shown for the first time that the human voice can even reveal early signs of hypoglycemia. Recordings made with the microphone of an ordinary smartphone and analyzed using artificial intelligence could make diabetes management safer and easier in the future.  Low blood sugar, medically known as hypoglycemia, is one of the most common and dangerous acute complications of diabetes. Within minutes, it can lead to dizziness, confusion, loss of consciousness, or even life-threatening situations. Despite modern glucose sensors, it is often difficult to recognize impending hypoglycemia in time. Yet the human voice is recognized to be a sensitive mirror of the body: it changes when we are tired, stressed, or ill; and, as it now turns out, also when blood sugar drops. The voice as a warning signal Researchers at the Inselspital, Bern University Hospital and the University of Bern, together with international partners, have shown for the first time that hypoglycemia can be reliably detected based on characteristic changes in the voice. All that was needed were voice recordings made with the microphone of a commercially available smartphone which were then evaluated using a machine-learning algorithm. In all, 22 people with type 1 diabetes took part in two clinical studies. Under strictly controlled conditions, the participants' blood sugar levels were, on the one hand, adjusted to a normal level and, on the other, deliberately lowered to induce hypoglycemia. Within these two phases, the participants spoke into the microphone of an ordinary smartphone in a quiet room. They read texts aloud, described images, held vowels or repeated syllable sequences in rapid succession. This resulted in a total of 540 voice recordings taken at normal or at low blood sugar levels. The researchers then evaluated the audio recordings using a machine-learning algorithm. The AI analyzed subtle differences in the voice, such as pitch, volume, resonance, clarity, and sound dynamics. On this basis, its ability to detect hypoglycemia was very reliable. The AI achieved its best results when the participants read aloud, where it correctly detected hypoglycemia in around 90 percent of cases. When repeating short syllables, the accuracy was around 87 percent. Simple technology with great potential The study proves for the first time that changes in the voice can indicate an acute medical problem. «Our findings clearly show that the voice can provide important clues about a person's state of health,» says Prof. Christoph Stettler, Director and Chief Physician of the Department of Diabetes, Endocrinology, Nutritional Medicine and Metabolism at the Inselspital Bern (UDEM), the study's last author. «Using an ordinary smartphone and artificial intelligence, hypoglycemia can be detected at an early stage without the need for additional devices.» Dr. Vera Lehmann, clinical research physician and the study's first author, also emphasizes the significance of the results: «We were able to show that an ordinary smartphone is sufficient to detect physiological changes that people are themselves sometimes unaware of. This opens up completely new possibilities for ways in which technology can help prevent dangerous situations in the future.» Given the widespread use of smartphones, this approach could improve the detection and prevention of hypoglycemia worldwide, especially in regions where modern glucose sensors are not widely available. However, the researchers emphasize that the method is intended to complement existing technologies, not replace them. Bern as a hub for innovative diabetes research The research team at the Inselspital and the University of Bern is one of the world's leading groups in the field of AI-supported diabetes research. In an earlier study, the research group has already shown that behavior while driving a car can indicate low blood sugar levels. With this current study, the researchers have added a new dimension to the spectrum: the voice as a biomarker for acute metabolic imbalance. Next steps toward everyday use In further studies, the researchers now want to test whether voice analysis is also effective in everyday speech situations, such as when using voice assistants like Siri or Alexa. If the approach proves successful, then in the future, simple voice commands could help early detection of dangerously low blood sugar levels, making life safer for people with diabetes"
print(f"\n\n Article: {article_sample}")
summary = (summarizer(article_sample, max_length=128, min_length=20, do_sample=False)[0]["summary_text"])
print(f"\n\n Summary:{summary}")

Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)






 Summary:Low blood sugar detected by speaking into a smartphone .
Human voice can even reveal early signs of hypoglycemia .
Researchers at the Inselspital, Bern University Hospital and the University of Bern have now shown for the first time that the human voice can be reliably detected .
Humans are recognized to be a sensitive mirror of the body .


# Todo

1. Choose a different news story to summarize.
2. Compare the result with another encoder-decoder model such as T5. Which performs better? Discuss