# 📝 **Introduction**

This Python script demonstrates how to use the Pegasus model from Hugging Face's Transformers library to perform abstractive summarization on a given text. It evaluates the quality of the generated summaries using both lexical (ROUGE) and semantic (BERTScore) metrics. The model used, `google/pegasus-cnn_dailymail`, is fine-tuned specifically for summarizing news articles.

---

## 🔧 Installation and Required Libraries

```python
!pip install transformers datasets rouge_score bert_score --quiet
```

* `transformers`: Provides state-of-the-art pre-trained models like Pegasus.
* `datasets`: Useful for working with text datasets.
* `rouge_score`: Used to calculate ROUGE scores for summary evaluation.
* `bert_score`: Provides a semantic similarity score using BERT embeddings.

In [2]:
!pip install transformers datasets rouge_score bert_score --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [9

---

## 🧩 Importing Libraries

```python
import logging
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from rouge_score import rouge_scorer
from bert_score import score
```

* Necessary modules are imported.
* `PegasusForConditionalGeneration`: Loads the Pegasus model for summarization.
* `PegasusTokenizer`: Tokenizes text inputs for the model.


In [3]:
import logging
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from rouge_score import rouge_scorer
from bert_score import score

2025-06-27 13:25:39.082795: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751030739.349015      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751030739.430106      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


---

## 🔇 Logging Configuration

```python
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("absl").setLevel(logging.ERROR)
```

* Reduces verbosity by setting logging level to only show errors.


In [4]:
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("absl").setLevel(logging.ERROR)

---

## 🤖 Loading the Model and Tokenizer

```python
model_name = "google/pegasus-cnn_dailymail"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)
```

* The Pegasus model fine-tuned on the CNN/DailyMail dataset is loaded.
* Tokenizer converts input text into tokens.
* The model uses these tokens to generate a summary.

In [5]:
model_name = "google/pegasus-cnn_dailymail"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

---

## ✂️ Summarization Function

```python
def summarize(text, max_length=60):
```

* `text`: Input text to summarize.
* `max_length`: Maximum length of the generated summary.

```python
    inputs = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=1024,
        return_tensors="pt"
    )
```

* The text is tokenized.
* Long texts are truncated.
* Padding is applied to reach maximum length.
* Input is converted to PyTorch tensors.

```python
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=max_length,
        num_beams=4,
        early_stopping=True
    )
```

* The model generates a summary using beam search.
* `num_beams=4`: Beam width for better output.
* `early_stopping=True`: Stops early if an optimal summary is found.

```python
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summary = summary.replace("<n>", " ").strip()
    return summary
```

* The generated summary is decoded back into text.
* Special tokens are removed.



In [6]:
def summarize(text, max_length=60):
    inputs = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=1024,
        return_tensors="pt"
    )
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=max_length,
        num_beams=4,
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summary = summary.replace("<n>", " ").strip()
    return summary

---

## 📏 ROUGE Evaluation Function

```python
def evaluate_rouge(reference, summary):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference, summary)
```

* `reference`: The original text.
* `summary`: The generated summary.
* `rouge1`: Unigram (word-level) overlap.
* `rouge2`: Bigram overlap.
* `rougeL`: Longest common subsequence.

In [7]:
def evaluate_rouge(reference, summary):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference, summary)

---

## 🤖 BERTScore Evaluation Function

```python
def evaluate_bertscore(reference, summary):
    P, R, F1 = score([summary], [reference], lang='en', verbose=False)
    return {"precision": P[0].item(), "recall": R[0].item(), "f1": F1[0].item()}
```

* `score(...)`: Computes semantic similarity between summary and reference.
* Outputs: Precision, Recall, and F1 Score.

In [8]:
def evaluate_bertscore(reference, summary):
    P, R, F1 = score([summary], [reference], lang='en', verbose=False)
    return {"precision": P[0].item(), "recall": R[0].item(), "f1": F1[0].item()}

---

## 🔍 Main Execution Block

```python
if __name__ == "__main__":
```

* Ensures the code only runs when the script is executed directly.

### Original Text

```python
original_text = """ ... """
```

* A sample passage describing the Apollo space program.

### Generate Summary

```python
summary = summarize(original_text)
```

* The input text is summarized using Pegasus.

### Compute ROUGE Score

```python
rouge_scores = evaluate_rouge(original_text, summary)
```

### Compute BERTScore

```python
bert_scores = evaluate_bertscore(original_text, summary)
```

### Display Evaluation Scores

```python
print("\n📊 ROUGE Scores:")
...
print("\n🤖 BERTScore:")
...
```

* Prints out ROUGE and BERTScore metrics to evaluate summary quality.

In [9]:
if __name__ == "__main__":
    
    original_text = """
    The Apollo program was the third United States human spaceflight program carried out by NASA,
    which accomplished landing the first humans on the Moon from 1969 to 1972.
    First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury,
    which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of "landing a man on the Moon and returning him safely to the Earth" by the end of the 1960s,
    which he proposed in a May 25, 1961, address to Congress.
    """

    print("📄 Original Text:\n", original_text.strip())

     
    summary = summarize(original_text)
    print("\n✂️ Generated Summary:\n", summary)

    
    rouge_scores = evaluate_rouge(original_text, summary)
    bert_scores = evaluate_bertscore(original_text, summary)

    
    print("\n📊 ROUGE Scores:")
    for k, v in rouge_scores.items():
        print(f"{k.upper()}: Precision={v.precision:.4f}, Recall={v.recall:.4f}, F1={v.fmeasure:.4f}")

    
    print("\n🤖 BERTScore:")
    print(f"Precision: {bert_scores['precision']:.4f}")
    print(f"Recall:    {bert_scores['recall']:.4f}")
    print(f"F1 Score:  {bert_scores['f1']:.4f}")


📄 Original Text:
 The Apollo program was the third United States human spaceflight program carried out by NASA,
    which accomplished landing the first humans on the Moon from 1969 to 1972.
    First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury,
    which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of "landing a man on the Moon and returning him safely to the Earth" by the end of the 1960s,
    which he proposed in a May 25, 1961, address to Congress.

✂️ Generated Summary:
 The Apollo program was the third U.S. human spaceflight program carried out by NASA . It landed the first humans on the Moon from 1969 to 1972 . It was later dedicated to President John F. Kennedy's national goal of "landing a man on the Moon


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]


📊 ROUGE Scores:
ROUGE1: Precision=0.9348, Recall=0.4388, F1=0.5972
ROUGE2: Precision=0.8444, Recall=0.3918, F1=0.5352
ROUGEL: Precision=0.9130, Recall=0.4286, F1=0.5833

🤖 BERTScore:
Precision: 0.9537
Recall:    0.8692
F1 Score:  0.9095


---

## Assessing the Quality of Summarization with ROUGE and BERTScore

## 📄 **Original Text**

The original passage discusses the **Apollo program**, focusing on its historical context and purpose. It highlights NASA's achievement of landing humans on the Moon, as well as the national goals shaped during the Eisenhower and Kennedy administrations.

---

## ✂️ **Generated Summary**

> **"The Apollo program was the third U.S. human spaceflight program carried out by NASA. It landed the first humans on the Moon from 1969 to 1972. It was later dedicated to President John F. Kennedy's national goal of 'landing a man on the Moon'"**

### 🔍 Content Evaluation:

* **Strengths**:

  * Preserves the main ideas: the program’s identity, timeline, NASA’s involvement, and Kennedy’s role are clearly conveyed.
  * The summary is short, informative, and to the point.

* **Weaknesses**:

  * The final sentence seems incomplete or abruptly cut off.
  * No mention of Eisenhower's initial role in conceptualizing the program.
  * Lacks historical context, such as Kennedy’s 1961 speech to Congress.

---

## 📊 **Interpretation of ROUGE Scores**

| Metric  | Precision | Recall | F1 Score |
| ------- | --------- | ------ | -------- |
| ROUGE-1 | 0.9348    | 0.4388 | 0.5972   |
| ROUGE-2 | 0.8444    | 0.3918 | 0.5352   |
| ROUGE-L | 0.9130    | 0.4286 | 0.5833   |

### 📌 Explanation:

* **Precision**: Measures how much of the generated summary’s content overlaps with the reference.

  * High: 0.93 (ROUGE-1) → Most of the words in the summary appear in the original text.
* **Recall**: Measures how much of the important content from the reference is captured in the summary.

  * Relatively low: \~0.43 → The summary misses several details from the original.
* **F1 Score**: Harmonic mean of Precision and Recall.

  * Moderate: \~0.58 → The summary is informative but omits key elements.

---

## 🤖 **BERTScore Evaluation**

| Metric    | Value  |
| --------- | ------ |
| Precision | 0.9537 |
| Recall    | 0.8692 |
| F1 Score  | 0.9095 |

### 📌 Explanation:

* BERTScore assesses how semantically similar the generated summary is to the original text.
* **Very high scores** → The summary retains most of the intended meaning.
* This indicates that Pegasus performs well in capturing the essence of the text, even if some details are missing.

---

## ✅ **Overall Evaluation**

* The summary is **informative** and **semantically accurate**, but lacks a few important details and ends with an incomplete sentence.
* **ROUGE scores** reveal that while the surface-level word overlap is solid, the coverage of content is somewhat limited.
* **BERTScore results** confirm that the summary aligns well with the original meaning.


---

## 🎯 Purpose

This script has two primary goals:

1. To summarize long-form text using the Pegasus transformer model.
2. To evaluate the quality of the generated summary using both lexical (ROUGE) and semantic (BERTScore) metrics.

---

### ✅ **Conclusion**

This code offers a practical solution for summarizing long texts using a state-of-the-art transformer model and evaluating the results with both traditional and semantic-based metrics. ROUGE provides insight into surface-level overlap, while BERTScore measures deeper semantic similarity. Together, they offer a comprehensive assessment of summarization quality, making this approach highly suitable for academic, journalistic, or real-world NLP tasks.