# 4. Text Summarization

**Estimated Time**: ~2 hours

**Prerequisites**: Notebooks 1-3 (understanding of tokenization, pipelines, confidence scores, and context comprehension from QA)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Distinguish** between extractive and abstractive summarization approaches
2. **Understand** encoder-decoder (seq2seq) architecture for text generation
3. **Control** summary length using min/max length parameters
4. **Tune** generation parameters like beam search, sampling, and repetition penalty
5. **Recognize** and mitigate hallucination in generated summaries

## Setup

Run this cell first. If you completed Notebooks 1-3, you already have the core packages ready.

In [None]:
# Core imports
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")

---

# Part 1: Conceptual Foundation

## What is Text Summarization?

**In plain English**: Text summarization compresses a long document into a shorter version while preserving the most important information.

**Technical definition**: Summarization is a text-to-text generation task where the model takes a long input sequence and produces a shorter output sequence that captures the key content.

### Visual Example

```
INPUT (Long Article):
"Scientists at MIT have developed a new type of battery that can charge 
in under 5 minutes. The breakthrough uses a novel electrode material made 
from carbon nanotubes. Lead researcher Dr. Smith says the technology could 
revolutionize electric vehicles by eliminating range anxiety. The team 
published their findings in Nature Energy yesterday..."

OUTPUT (Summary):
"MIT researchers created a fast-charging battery using carbon nanotube 
electrodes that could transform electric vehicles."
```

### Two Types of Summarization

| Type | How It Works | Output | Pros/Cons |
|------|--------------|--------|----------|
| **Extractive** | Selects and combines existing sentences | Exact sentences from original | More faithful, but can be choppy |
| **Abstractive** | Generates new text that paraphrases content | Newly written summary | More fluent, but may hallucinate |

This notebook focuses on **Abstractive Summarization** - the model generates new text.

```
EXTRACTIVE SUMMARIZATION:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Original: "The cat sat on the mat. It was   ‚îÇ
‚îÇ a sunny day. The cat was happy."            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                  ‚îÇ (select important sentences)
                  ‚ñº
    "The cat sat on the mat. The cat was happy."
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        Exact sentences from original

ABSTRACTIVE SUMMARIZATION (This Notebook):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Original: "The cat sat on the mat. It was   ‚îÇ
‚îÇ a sunny day. The cat was happy."            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                  ‚îÇ (understand and rewrite)
                  ‚ñº
    "A happy cat enjoyed sitting on a mat on a sunny day."
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               Newly generated paraphrase
```

### How Abstractive Summarization Works: Encoder-Decoder

Unlike the encoder-only models (BERT) from Notebooks 1-3, summarization uses an **encoder-decoder** architecture:

```
                    ENCODER                         DECODER
              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
INPUT TEXT    ‚îÇ                 ‚îÇ   CONTEXT   ‚îÇ                 ‚îÇ   OUTPUT
"Scientists   ‚îÇ  Processes and  ‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫  ‚îÇ  Generates new  ‚îÇ   "MIT creates
 at MIT..."   ‚îÇ  understands    ‚îÇ   VECTORS   ‚îÇ  text word by   ‚îÇ    fast battery"
              ‚îÇ  the input      ‚îÇ             ‚îÇ  word           ‚îÇ
              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

The encoder reads the entire input and creates a "meaning" representation.
The decoder uses that meaning to generate the summary one token at a time.
```

Popular encoder-decoder models for summarization:
- **BART** (Facebook): Bidirectional encoder + autoregressive decoder
- **T5** (Google): "Text-to-Text Transfer Transformer"
- **Pegasus** (Google): Pre-trained specifically for summarization

### Connection to Previous Notebooks

| Notebook | Architecture | Task |
|----------|--------------|------|
| 1-3 (MLM, NER, QA) | Encoder-only (BERT) | Understanding/extraction |
| **4 (Summarization)** | **Encoder-Decoder** | **Generation** |

```
ENCODER-ONLY (Previous Notebooks):     ENCODER-DECODER (This Notebook):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ      ENCODER     ‚îÇ                   ‚îÇ      ENCODER     ‚îÇ‚îÄ‚îÄ‚îÄ‚îÇ     DECODER      ‚îÇ
‚îÇ                  ‚îÇ                   ‚îÇ                  ‚îÇ   ‚îÇ                  ‚îÇ
‚îÇ  Input ‚Üí Output  ‚îÇ                   ‚îÇ  Input ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ‚îÄ‚îÄ‚îÄ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Output    ‚îÇ
‚îÇ  (same length)   ‚îÇ                   ‚îÇ  (understanding) ‚îÇ   ‚îÇ  (generation)    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
   Tasks:                                 Tasks:
   - Fill masked words                    - Summarization
   - Classify tokens (NER)                - Translation
   - Extract spans (QA)                   - Text generation
```

### Key Generation Concepts

Unlike extraction (finding existing text), generation requires **decoding strategies**:

| Concept | Description |
|---------|-------------|
| **Beam Search** | Explores multiple candidate sequences, keeps top-k best |
| **Greedy Decoding** | Always picks the highest-probability next token |
| **Sampling** | Randomly samples from probability distribution |
| **Top-k Sampling** | Samples only from the k most likely tokens |
| **Top-p (Nucleus)** | Samples from smallest set covering p probability mass |
| **Temperature** | Controls randomness (lower = more focused, higher = more random) |
| **Repetition Penalty** | Discourages repeating the same words/phrases |

### Real-World Applications

Summarization powers many practical applications:

- **News Digests**: Summarize articles for quick reading
- **Meeting Notes**: Condense transcripts into action items
- **Legal/Medical**: Summarize lengthy documents
- **Email Triage**: Preview long emails with summaries
- **Research**: Summarize academic papers
- **Social Media**: Auto-generate post summaries

### Key Terminology

| Term | Definition |
|------|------------|
| **Abstractive** | Generating new text to summarize content |
| **Extractive** | Selecting existing sentences to form summary |
| **Encoder-Decoder** | Architecture with separate understanding and generation components |
| **Seq2Seq** | Sequence-to-sequence: mapping one sequence to another |
| **Beam Search** | Decoding strategy exploring multiple candidates |
| **Hallucination** | Model generates plausible but factually incorrect content |
| **Compression Ratio** | Original length / Summary length |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end):

1. What type of summarization generates new text rather than selecting existing sentences?
   - A) Extractive summarization
   - B) Abstractive summarization
   - C) Selective summarization

2. What architecture is used for abstractive summarization?
   - A) Encoder-only (like BERT)
   - B) Decoder-only (like GPT)
   - C) Encoder-Decoder (like BART, T5)

3. What is "hallucination" in the context of summarization?
   - A) When the model refuses to generate output
   - B) When the model generates plausible but incorrect information
   - C) When the summary is too long

4. What does beam search do during text generation?
   - A) Searches for keywords in the input
   - B) Explores multiple candidate sequences and keeps the best ones
   - C) Splits the input into beams of text

---

# Part 2: Basic Implementation

## Your First Summarization Pipeline

Let's create a summarization pipeline and compress some text:

In [None]:
# Create a summarization pipeline
# The default model is sshleifer/distilbart-cnn-12-6
summarizer = pipeline("summarization")

# Sample article to summarize
article = """
Artificial intelligence researchers at Google DeepMind have achieved a significant 
breakthrough in protein structure prediction. Their AI system, called AlphaFold, 
can accurately predict the 3D structure of proteins from their amino acid sequences. 
This has been one of biology's grand challenges for over 50 years.

The implications are enormous for drug discovery and understanding diseases. 
Previously, determining protein structures required expensive and time-consuming 
experiments like X-ray crystallography. AlphaFold can now predict structures in 
hours rather than years, and with remarkable accuracy.

The team has released predictions for nearly all known proteins, totaling over 
200 million structures. Scientists worldwide are already using this data to 
accelerate research in areas ranging from antibiotic resistance to plastic 
pollution cleanup.
"""

# Generate summary
summary = summarizer(article, max_length=60, min_length=20, do_sample=False)

print("Original article:")
print(article.strip())
print(f"\n{'='*60}\n")
print("Summary:")
print(summary[0]['summary_text'])

### Understanding the Output

The summarization pipeline returns a list of dictionaries, each containing:
- `summary_text`: The generated summary

Let's examine compression ratios:

In [None]:
# Calculate compression statistics
original_words = len(article.split())
summary_words = len(summary[0]['summary_text'].split())
compression_ratio = original_words / summary_words

print("Compression Statistics:")
print("="*40)
print(f"  Original words:   {original_words}")
print(f"  Summary words:    {summary_words}")
print(f"  Compression ratio: {compression_ratio:.1f}x")
print(f"  Reduction:        {(1 - summary_words/original_words)*100:.0f}%")

### Controlling Summary Length

You can control the output length with `min_length` and `max_length` parameters:

In [None]:
# Different summary lengths
length_configs = [
    {"name": "Very Short", "min_length": 10, "max_length": 30},
    {"name": "Short", "min_length": 20, "max_length": 50},
    {"name": "Medium", "min_length": 40, "max_length": 80},
    {"name": "Long", "min_length": 60, "max_length": 120},
]

print("Summary Length Comparison:")
print("="*70)

for config in length_configs:
    result = summarizer(
        article, 
        min_length=config["min_length"], 
        max_length=config["max_length"],
        do_sample=False
    )
    words = len(result[0]['summary_text'].split())
    print(f"\n[{config['name']}] ({words} words):")
    print(f"  {result[0]['summary_text']}")

### Processing Different Types of Text

Let's see how summarization performs on various text types:

In [None]:
# Test on different text types
test_texts = {
    "News Article": """
    The Federal Reserve announced today that it will raise interest rates by 
    0.25 percentage points, bringing the benchmark rate to a range of 5.25% 
    to 5.5%. This marks the eleventh rate increase since March 2022 as the 
    central bank continues its fight against inflation. Fed Chair Jerome Powell 
    stated that future rate decisions will depend on incoming economic data. 
    Markets reacted positively to the news, with stocks rising slightly.
    """,
    
    "Scientific Abstract": """
    We present a novel approach to training large language models using 
    reinforcement learning from human feedback (RLHF). Our method involves 
    three stages: supervised fine-tuning on demonstration data, reward model 
    training, and policy optimization using proximal policy optimization (PPO). 
    Experiments on multiple benchmarks show that our approach significantly 
    improves alignment with human preferences while maintaining strong 
    performance on standard NLP tasks.
    """,
    
    "Product Description": """
    Introducing the SmartHome Hub 3.0, the ultimate central control system for 
    your connected home. With support for over 500 smart devices, voice control 
    via Alexa and Google Assistant, and our new AI-powered automation engine, 
    you can create the perfect smart home experience. The hub features a 
    7-inch touchscreen display, built-in speaker, and Thread/Matter compatibility 
    for future-proof connectivity.
    """,
}

print("Summarizing Different Text Types:")
print("="*70)

for text_type, text in test_texts.items():
    result = summarizer(text, max_length=50, min_length=15, do_sample=False)
    print(f"\n[{text_type}]")
    print(f"  Summary: {result[0]['summary_text']}")

---

## Exercise 1: Multi-Length Summaries (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Create summaries of varying lengths for the same article and compare them.

### Step 1: Create a summary generator function

In [None]:
def generate_multi_length_summaries(text, summarizer_pipeline):
    """
    Generate summaries at different compression levels.
    
    Returns:
        dict with 'tweet', 'paragraph', and 'abstract' summaries
    """
    summaries = {}
    
    # Tweet-length (very short, ~280 characters / ~40 words)
    result = summarizer_pipeline(text, min_length=10, max_length=40, do_sample=False)
    summaries['tweet'] = result[0]['summary_text']
    
    # Paragraph (medium, ~100 words)
    result = summarizer_pipeline(text, min_length=40, max_length=100, do_sample=False)
    summaries['paragraph'] = result[0]['summary_text']
    
    # Abstract (longer, ~150 words)
    result = summarizer_pipeline(text, min_length=80, max_length=150, do_sample=False)
    summaries['abstract'] = result[0]['summary_text']
    
    return summaries

# Test article
climate_article = """
A new study published in Nature Climate Change reveals that global temperatures 
could rise by 2.7 degrees Celsius by 2100 under current policies. Researchers 
from the University of Oxford analyzed data from 195 countries and found that 
existing climate commitments fall short of the Paris Agreement goals.

The study highlights several key findings: ice sheet melting is accelerating 
faster than predicted, sea levels could rise by up to one meter, and extreme 
weather events are becoming more frequent. Lead author Dr. Sarah Chen emphasized 
that immediate action is needed to prevent irreversible damage.

However, the research also identified pathways to limit warming to 1.5 degrees. 
These include rapid decarbonization of the energy sector, widespread adoption of 
electric vehicles, reforestation efforts, and carbon capture technologies. The 
researchers estimate this would require $4 trillion in annual investment.

Several governments have already responded to the findings. The European Union 
announced plans to accelerate its green transition, while China committed to 
reaching carbon neutrality before 2060. Climate activists are calling for even 
more ambitious targets.
"""

summaries = generate_multi_length_summaries(climate_article, summarizer)

print("Multi-Length Summary Comparison:")
print("="*70)

### Step 2: Display and compare summaries

In [None]:
# Display summaries with statistics
original_words = len(climate_article.split())

for level, summary_text in summaries.items():
    words = len(summary_text.split())
    chars = len(summary_text)
    compression = original_words / words
    
    print(f"\n{'='*70}")
    print(f"[{level.upper()}] - {words} words, {chars} chars, {compression:.1f}x compression")
    print(f"{'='*70}")
    print(summary_text)

### Step 3: Try your own article

Paste your own article and generate multi-length summaries:

In [None]:
# YOUR CODE HERE
# Paste your own article
my_article = """
Paste your article here. It should be at least a few paragraphs long
for meaningful summarization. News articles, blog posts, or research
abstracts work well.
"""

# Uncomment to run:
# my_summaries = generate_multi_length_summaries(my_article, summarizer)
# for level, summary in my_summaries.items():
#     print(f"\n[{level.upper()}]")
#     print(summary)

---

# Part 3: Intermediate Exploration

## Generation Parameters Deep Dive

Summarization quality depends heavily on generation parameters. Let's explore them:

In [None]:
# Sample text for experiments
sample_text = """
SpaceX successfully launched its Starship rocket on its third test flight, 
achieving several milestones. The massive rocket reached space for the first 
time, demonstrating its potential for future Mars missions. However, both the 
booster and the spacecraft were lost during the descent phase. Despite this, 
SpaceX called the mission a success, noting the valuable data collected. The 
company plans to continue rapid iteration on the design.
"""

# Beam search vs Greedy decoding
print("Beam Search vs Greedy Decoding:")
print("="*60)

# Greedy (num_beams=1)
greedy_result = summarizer(
    sample_text, 
    max_length=50, 
    num_beams=1,  # Greedy
    do_sample=False
)
print(f"\n[Greedy (num_beams=1)]:")
print(f"  {greedy_result[0]['summary_text']}")

# Beam search (num_beams=4)
beam_result = summarizer(
    sample_text, 
    max_length=50, 
    num_beams=4,  # Beam search
    do_sample=False
)
print(f"\n[Beam Search (num_beams=4)]:")
print(f"  {beam_result[0]['summary_text']}")

In [None]:
# Sampling with different temperatures
print("\nTemperature Effects (with sampling):")
print("="*60)

temperatures = [0.3, 0.7, 1.0, 1.5]

for temp in temperatures:
    result = summarizer(
        sample_text, 
        max_length=50,
        do_sample=True,
        temperature=temp,
        top_k=50,
    )
    print(f"\n[Temperature = {temp}]:")
    print(f"  {result[0]['summary_text']}")

### Understanding Generation Parameters

| Parameter | Effect | Typical Values |
|-----------|--------|----------------|
| `num_beams` | Higher = explores more candidates | 1 (greedy) to 8 |
| `do_sample` | True = random sampling, False = deterministic | True/False |
| `temperature` | Higher = more random, Lower = more focused | 0.1 to 2.0 |
| `top_k` | Limits to k most likely tokens | 10 to 100 |
| `top_p` | Nucleus sampling - uses smallest set with p probability | 0.9 to 0.95 |
| `repetition_penalty` | Discourages repeating tokens | 1.0 to 2.0 |

In [None]:
# Repetition penalty demonstration
# Some texts can cause repetition issues
repetitive_text = """
The new product is great. The product has many features. The product is 
available in stores. The product costs $99. The product is very popular. 
The product has received positive reviews. The product is manufactured 
in the USA. The product comes with a warranty.
"""

print("Repetition Penalty Effect:")
print("="*60)

# Without repetition penalty
result_no_penalty = summarizer(
    repetitive_text,
    max_length=50,
    repetition_penalty=1.0,  # No penalty
    do_sample=False
)
print(f"\n[No Penalty (1.0)]:")
print(f"  {result_no_penalty[0]['summary_text']}")

# With repetition penalty
result_with_penalty = summarizer(
    repetitive_text,
    max_length=50,
    repetition_penalty=1.5,  # Penalty applied
    do_sample=False
)
print(f"\n[With Penalty (1.5)]:")
print(f"  {result_with_penalty[0]['summary_text']}")

### Comparing Summarization Models

Different models have different strengths. Let's compare a few:

In [None]:
# Load a different model - T5
print("Loading T5 summarization model...")
summarizer_t5 = pipeline("summarization", model="t5-small")
print("Model loaded!\n")

In [None]:
# Compare models on the same text
comparison_text = """
Apple unveiled its new Vision Pro headset at WWDC, marking the company's entry 
into the mixed reality market. The device features a high-resolution display 
system, advanced eye and hand tracking, and runs on a new operating system 
called visionOS. Priced at $3,499, it targets professional and creative users 
rather than the mass market. Analysts are divided on whether it will succeed 
in a market dominated by Meta's more affordable Quest devices.
"""

print("Model Comparison:")
print("="*70)

# DistilBART (default)
result_bart = summarizer(comparison_text, max_length=50, min_length=20, do_sample=False)
print(f"\n[DistilBART (default)]:")
print(f"  {result_bart[0]['summary_text']}")

# T5
result_t5 = summarizer_t5(comparison_text, max_length=50, min_length=20, do_sample=False)
print(f"\n[T5-small]:")
print(f"  {result_t5[0]['summary_text']}")

---

## Exercise 2: Parameter Tuning (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Write a function that finds the best generation parameters for a given text.

**Hints**:
1. Try different combinations of parameters
2. Evaluate based on length, readability, and information preservation
3. Consider using metrics like word overlap with original

In [None]:
# YOUR CODE HERE

def evaluate_summary(original, summary):
    """
    Simple evaluation metrics for a summary.
    """
    original_words = set(original.lower().split())
    summary_words = set(summary.lower().split())
    
    # Word overlap (crude measure of information retention)
    overlap = len(original_words & summary_words) / len(original_words)
    
    # Compression ratio
    compression = len(original.split()) / max(len(summary.split()), 1)
    
    # Length check
    word_count = len(summary.split())
    
    return {
        'word_overlap': overlap,
        'compression_ratio': compression,
        'word_count': word_count,
    }


def find_best_parameters(summarizer_pipeline, text, target_length=50):
    """
    Try different parameter combinations and return the best summary.
    """
    parameter_sets = [
        {"num_beams": 1, "do_sample": False, "name": "Greedy"},
        {"num_beams": 4, "do_sample": False, "name": "Beam-4"},
        {"num_beams": 4, "do_sample": False, "length_penalty": 1.5, "name": "Beam-4+Length"},
        {"num_beams": 1, "do_sample": True, "temperature": 0.7, "top_k": 50, "name": "Sampling"},
    ]
    
    results = []
    
    for params in parameter_sets:
        name = params.pop("name")
        
        summary = summarizer_pipeline(
            text,
            max_length=target_length,
            min_length=target_length // 2,
            **params
        )[0]['summary_text']
        
        metrics = evaluate_summary(text, summary)
        metrics['name'] = name
        metrics['summary'] = summary
        
        results.append(metrics)
        
        # Restore name for next iteration
        params["name"] = name
    
    # Sort by word overlap (higher is better)
    results.sort(key=lambda x: x['word_overlap'], reverse=True)
    
    return results


# Test the parameter finder
test_article = """
Researchers at Stanford University have developed a new AI system that can 
detect early signs of Alzheimer's disease from brain scans with 94% accuracy. 
The deep learning model was trained on thousands of MRI images and can identify 
subtle patterns that are invisible to the human eye. Early detection could 
enable earlier treatment and better patient outcomes. The team is now working 
with hospitals to conduct clinical trials.
"""

results = find_best_parameters(summarizer, test_article)

print("Parameter Comparison Results:")
print("="*70)

for r in results:
    print(f"\n[{r['name']}]")
    print(f"  Word overlap: {r['word_overlap']:.1%}")
    print(f"  Compression:  {r['compression_ratio']:.1f}x")
    print(f"  Words:        {r['word_count']}")
    print(f"  Summary:      {r['summary']}")

---

# Part 4: Advanced Topics

## Understanding Hallucination

A critical issue with abstractive summarization is **hallucination** - when the model generates plausible but factually incorrect information:

In [None]:
# Example where hallucination might occur
# (The model might add details not present in the original)
hallucination_test = """
The company announced record profits for the quarter. Revenue grew significantly 
compared to last year. The CEO expressed optimism about future growth.
"""

# Generate multiple summaries to see variation
print("Potential Hallucination Detection:")
print("="*60)
print(f"Original text: {hallucination_test.strip()}\n")

# Generate with sampling (more likely to hallucinate)
for i in range(3):
    result = summarizer(
        hallucination_test,
        max_length=40,
        do_sample=True,
        temperature=0.9,
    )
    print(f"Summary {i+1}: {result[0]['summary_text']}")

print("\n‚ö†Ô∏è Watch for: specific numbers, percentages, or names not in the original!")

In [None]:
# Strategies to reduce hallucination

def check_for_hallucinations(original, summary):
    """
    Simple check for potential hallucinations.
    Looks for specific patterns that might indicate made-up content.
    """
    import re
    
    warnings = []
    
    # Check for numbers in summary not in original
    summary_numbers = set(re.findall(r'\b\d+(?:\.\d+)?%?\b', summary))
    original_numbers = set(re.findall(r'\b\d+(?:\.\d+)?%?\b', original))
    new_numbers = summary_numbers - original_numbers
    
    if new_numbers:
        warnings.append(f"Numbers in summary not in original: {new_numbers}")
    
    # Check for proper nouns (simplified - words starting with capitals)
    # This is a rough heuristic
    summary_words = summary.split()
    original_lower = original.lower()
    
    for word in summary_words:
        if word[0].isupper() and len(word) > 2:
            if word.lower() not in original_lower:
                warnings.append(f"Capitalized word not in original: '{word}'")
    
    return warnings


# Test hallucination detection
original = """
The startup raised funding in its Series A round. Investors included several 
venture capital firms. The company plans to use the funds for expansion.
"""

# Potentially hallucinated summary
fake_summary = "The startup raised $50 million in Series A funding from Sequoia Capital."

print("Hallucination Detection:")
print("="*60)
print(f"Original: {original.strip()}")
print(f"\nSummary: {fake_summary}")

warnings = check_for_hallucinations(original, fake_summary)
if warnings:
    print(f"\n‚ö†Ô∏è Potential hallucinations detected:")
    for w in warnings:
        print(f"  - {w}")
else:
    print("\n‚úì No obvious hallucinations detected")

### Under the Hood: Encoder-Decoder Architecture

In [None]:
# Load model and tokenizer separately to see internals
model_name = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Model type: {type(model).__name__}")
print(f"\nModel structure:")
print(f"  Encoder layers: {len(model.model.encoder.layers)}")
print(f"  Decoder layers: {len(model.model.decoder.layers)}")
print(f"  Hidden size: {model.config.d_model}")
print(f"  Vocab size: {model.config.vocab_size}")

In [None]:
# Step-by-step generation
text = "Scientists discovered a new species of deep-sea fish. The fish lives at depths exceeding 8,000 meters."

# STEP 1: Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

print("STEP 1 - Tokenization:")
print(f"  Input text: '{text}'")
print(f"  Input tokens: {inputs['input_ids'].shape[1]}")
print(f"  Token IDs: {inputs['input_ids'][0][:10].tolist()}...")

In [None]:
# STEP 2: Encode
with torch.no_grad():
    encoder_outputs = model.model.encoder(**inputs)

print("STEP 2 - Encoding:")
print(f"  Encoder output shape: {encoder_outputs.last_hidden_state.shape}")
print(f"  (batch_size, sequence_length, hidden_size)")

In [None]:
# STEP 3: Generate (decode)
with torch.no_grad():
    generated_ids = model.generate(
        inputs['input_ids'],
        max_length=50,
        num_beams=4,
        early_stopping=True,
        return_dict_in_generate=True,
        output_scores=True,
    )

# Decode the generated tokens
summary = tokenizer.decode(generated_ids.sequences[0], skip_special_tokens=True)

print("STEP 3 - Generation:")
print(f"  Generated token count: {len(generated_ids.sequences[0])}")
print(f"  Generated summary: '{summary}'")

In [None]:
# Visualize generation step by step
print("\nToken-by-token generation:")
print("="*50)

tokens = tokenizer.convert_ids_to_tokens(generated_ids.sequences[0])
for i, token in enumerate(tokens[:15]):  # First 15 tokens
    if token not in ['<s>', '</s>', '<pad>']:
        print(f"  Step {i:2}: '{token}'")

### Performance Considerations

| Consideration | Recommendation |
|---------------|----------------|
| **Model size** | DistilBART for speed, BART-large for quality |
| **Long inputs** | Use models trained for long documents (LED, Longformer) |
| **Hallucination** | Use beam search (not sampling), verify facts |
| **Batch processing** | Process multiple texts together |
| **Length control** | Tune min/max_length for your use case |

### Limitations of Abstractive Summarization

1. **Hallucination**: May generate plausible but incorrect facts
2. **Context limits**: Most models limited to ~1024 tokens
3. **Domain sensitivity**: Models trained on news may struggle with technical text
4. **Evaluation difficulty**: Hard to automatically measure summary quality
5. **Abstractiveness**: Sometimes just paraphrases rather than truly summarizes

In [None]:
# Demonstrate domain sensitivity
domain_examples = {
    "News (good)": """
    The president announced new economic policies today at a press conference. 
    The measures include tax cuts for middle-income families and increased 
    infrastructure spending. Market analysts responded positively to the news.
    """,
    
    "Technical (may struggle)": """
    The implementation uses a red-black tree with amortized O(log n) insertion 
    complexity. Memory allocation is handled through a custom slab allocator to 
    minimize fragmentation. The garbage collector uses a generational approach 
    with mark-and-sweep for the old generation.
    """,
    
    "Legal (may struggle)": """
    Pursuant to Section 14(a) of the Securities Exchange Act of 1934, the 
    undersigned hereby certifies that to the best of their knowledge, the 
    proxy statement does not contain any untrue statement of material fact 
    or omit to state a material fact necessary to make the statements therein 
    not misleading.
    """,
}

print("Domain Comparison:")
print("="*70)

for domain, text in domain_examples.items():
    result = summarizer(text, max_length=40, min_length=15, do_sample=False)
    print(f"\n[{domain}]")
    print(f"  {result[0]['summary_text']}")

---

## Exercise 3: Model Comparison (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Build a class that compares multiple summarization models on the same text.

**Requirements**:
1. Load and compare at least 2 different models
2. Generate summaries with consistent parameters
3. Calculate comparison metrics
4. Identify which model works best for different text types

In [None]:
# YOUR CODE HERE

class SummarizationComparator:
    """
    Compares multiple summarization models on the same text.
    """
    
    def __init__(self):
        """Initialize with multiple models."""
        self.models = {}
        self.load_models()
    
    def load_models(self):
        """Load summarization models."""
        print("Loading models...")
        
        # Default DistilBART
        self.models['DistilBART'] = pipeline("summarization")
        print("  Loaded DistilBART")
        
        # T5-small
        self.models['T5-small'] = pipeline("summarization", model="t5-small")
        print("  Loaded T5-small")
        
        print("All models loaded!")
    
    def compare(self, text, max_length=60, min_length=20):
        """
        Compare all models on the same text.
        
        Returns:
            dict with model names as keys and results as values
        """
        results = {}
        original_words = len(text.split())
        
        for name, model in self.models.items():
            summary = model(
                text,
                max_length=max_length,
                min_length=min_length,
                do_sample=False,
            )[0]['summary_text']
            
            summary_words = len(summary.split())
            
            results[name] = {
                'summary': summary,
                'word_count': summary_words,
                'compression': original_words / summary_words,
                'char_count': len(summary),
            }
        
        return results
    
    def format_comparison(self, text, results):
        """
        Format comparison results for display.
        """
        lines = []
        lines.append("Summarization Model Comparison")
        lines.append("=" * 70)
        lines.append(f"Original text ({len(text.split())} words):")
        lines.append(f"  {text[:150]}...")
        lines.append("")
        
        for name, result in results.items():
            lines.append(f"[{name}]")
            lines.append(f"  Words: {result['word_count']} | Compression: {result['compression']:.1f}x")
            lines.append(f"  Summary: {result['summary']}")
            lines.append("")
        
        return '\n'.join(lines)


# Create the comparator
comparator = SummarizationComparator()

In [None]:
# Test on different text types
test_texts = {
    "Tech News": """
    OpenAI released GPT-4, their most advanced AI model yet. The new model 
    demonstrates improved reasoning capabilities and can process both text 
    and images. Early tests show it outperforms previous versions on various 
    benchmarks. The company is gradually rolling out access to developers 
    and businesses through their API.
    """,
    
    "Science": """
    A team of astronomers has detected water vapor in the atmosphere of a 
    planet orbiting in the habitable zone of a distant star. The exoplanet, 
    named K2-18b, is located about 110 light-years from Earth. This discovery 
    marks the first time water has been found on a potentially habitable 
    world outside our solar system. The findings were published in Nature.
    """,
}

for text_type, text in test_texts.items():
    print(f"\n{'#'*70}")
    print(f"TEXT TYPE: {text_type}")
    print(f"{'#'*70}")
    
    results = comparator.compare(text)
    print(comparator.format_comparison(text, results))

---

# Part 5: Mini-Project

## Project: Article Digest Generator

**Scenario**: You're building a content curation tool that needs to generate summaries at different lengths for different platforms.

**Your goal**: Build an `ArticleDigestGenerator` class that:
1. Takes an article as input
2. Generates summaries for tweet, paragraph, and abstract lengths
3. Includes hallucination warnings
4. Provides formatting for different platforms

In [None]:
# MINI-PROJECT: Article Digest Generator
# ======================================

import re

class ArticleDigestGenerator:
    """
    Generates multi-format digests from articles.
    """
    
    def __init__(self):
        """Initialize the digest generator."""
        self.summarizer = pipeline("summarization")
        
        # Length configurations for different formats
        self.formats = {
            'tweet': {'min': 10, 'max': 35, 'char_limit': 280},
            'paragraph': {'min': 30, 'max': 80, 'char_limit': 500},
            'abstract': {'min': 60, 'max': 150, 'char_limit': 1000},
        }
    
    def generate_digest(self, article):
        """
        Generate digests in all formats.
        
        Args:
            article: The full article text
            
        Returns:
            dict with digests for each format
        """
        digests = {}
        original_word_count = len(article.split())
        
        for format_name, config in self.formats.items():
            # Generate summary
            result = self.summarizer(
                article,
                min_length=config['min'],
                max_length=config['max'],
                do_sample=False,
                num_beams=4,
            )
            
            summary = result[0]['summary_text']
            
            # Trim to character limit if needed
            if len(summary) > config['char_limit']:
                summary = summary[:config['char_limit']-3] + "..."
            
            # Check for potential hallucinations
            warnings = self._check_hallucinations(article, summary)
            
            digests[format_name] = {
                'text': summary,
                'word_count': len(summary.split()),
                'char_count': len(summary),
                'compression': original_word_count / len(summary.split()),
                'warnings': warnings,
            }
        
        return digests
    
    def _check_hallucinations(self, original, summary):
        """
        Check for potential hallucinations in the summary.
        """
        warnings = []
        
        # Check for new numbers
        summary_nums = set(re.findall(r'\b\d+(?:\.\d+)?%?\b', summary))
        original_nums = set(re.findall(r'\b\d+(?:\.\d+)?%?\b', original))
        new_nums = summary_nums - original_nums
        
        if new_nums:
            warnings.append(f"New numbers not in original: {new_nums}")
        
        return warnings
    
    def format_for_twitter(self, digest):
        """
        Format digest for Twitter posting.
        """
        text = digest['tweet']['text']
        chars = digest['tweet']['char_count']
        
        lines = []
        lines.append("Twitter Post Preview")
        lines.append("-" * 40)
        lines.append(text)
        lines.append("-" * 40)
        lines.append(f"Characters: {chars}/280")
        
        if digest['tweet']['warnings']:
            lines.append(f"‚ö†Ô∏è Warnings: {digest['tweet']['warnings']}")
        
        return '\n'.join(lines)
    
    def format_for_newsletter(self, digest):
        """
        Format digest for email newsletter.
        """
        lines = []
        lines.append("üì∞ NEWSLETTER DIGEST")
        lines.append("=" * 50)
        lines.append("")
        lines.append("üìù TLDR:")
        lines.append(digest['tweet']['text'])
        lines.append("")
        lines.append("üìñ SUMMARY:")
        lines.append(digest['paragraph']['text'])
        lines.append("")
        lines.append("üìö DETAILED:")
        lines.append(digest['abstract']['text'])
        
        return '\n'.join(lines)
    
    def get_stats(self, digest):
        """
        Get statistics about the generated digests.
        """
        lines = []
        lines.append("Digest Statistics")
        lines.append("=" * 50)
        
        for format_name, data in digest.items():
            lines.append(f"\n[{format_name.upper()}]")
            lines.append(f"  Words: {data['word_count']}")
            lines.append(f"  Characters: {data['char_count']}")
            lines.append(f"  Compression: {data['compression']:.1f}x")
            
            if data['warnings']:
                lines.append(f"  ‚ö†Ô∏è Warnings: {len(data['warnings'])}")
        
        return '\n'.join(lines)


# Create the generator
generator = ArticleDigestGenerator()

In [None]:
# Test with a sample article
sample_article = """
Microsoft announced a major investment in artificial intelligence company OpenAI, 
committing billions of dollars to accelerate AI research and development. The 
partnership will integrate OpenAI's technology into Microsoft's cloud computing 
platform Azure, making advanced AI capabilities available to enterprise customers.

CEO Satya Nadella emphasized that this investment aligns with Microsoft's long-term 
vision of democratizing AI technology. The company plans to use OpenAI's models 
to enhance products across its portfolio, including Office, Bing, and GitHub Copilot.

Industry analysts view this as a strategic move to compete with Google and Amazon 
in the rapidly evolving AI market. The deal gives Microsoft exclusive access to 
OpenAI's most advanced models while providing OpenAI with the computing resources 
needed to train increasingly powerful AI systems.

OpenAI, known for creating ChatGPT and DALL-E, has emerged as a leader in 
generative AI technology. The investment comes at a time when demand for AI 
services is surging across industries, from healthcare to finance to entertainment.
"""

# Generate digests
digest = generator.generate_digest(sample_article)

# Display all formats
print(generator.get_stats(digest))

In [None]:
# Show Twitter format
print("\n")
print(generator.format_for_twitter(digest))

In [None]:
# Show newsletter format
print("\n")
print(generator.format_for_newsletter(digest))

In [None]:
# Try with your own article
# Uncomment to use:

# your_article = """
# Paste your article here...
# """
# your_digest = generator.generate_digest(your_article)
# print(generator.format_for_newsletter(your_digest))

### Extension Ideas

If you want to extend this project further:

1. **Keyword extraction**: Add keywords/hashtags to the tweet format
2. **Sentiment analysis**: Detect and report the article's sentiment
3. **Topic classification**: Categorize the article by topic
4. **Multi-article summaries**: Summarize multiple related articles together
5. **Custom formats**: Add formats for LinkedIn, email subjects, etc.

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Abstractive summarization** generates new text to summarize content, unlike extractive which selects existing sentences

2. **Encoder-decoder architecture** uses separate components for understanding input and generating output

3. **Generation parameters** like beam search, temperature, and repetition penalty significantly affect output quality

4. **Hallucination** is a real risk - models can generate plausible but factually incorrect information

5. **Length control** via min_length and max_length allows generating summaries for different use cases

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Using sampling without temperature control | Can produce random or incoherent output |
| Not checking for hallucinations | May publish factually incorrect summaries |
| Setting max_length too short | Cuts off important information |
| Using news-trained models on technical text | Poor quality summaries for specialized content |

## What's Next?

In **Notebook 5: Text Generation**, you'll learn:
- How to generate open-ended text continuations
- Different decoding strategies for creative control
- The creativity-coherence tradeoff in language models

This builds on summarization - both use encoder-decoder or decoder-only architectures, but generation focuses on creativity rather than compression!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) Abstractive summarization** - Generates new text rather than selecting existing sentences
2. **C) Encoder-Decoder (like BART, T5)** - Encoder understands input, decoder generates output
3. **B) When the model generates plausible but incorrect information** - A significant risk in abstractive summarization
4. **B) Explores multiple candidate sequences and keeps the best ones** - More thorough than greedy decoding

### Exercise 2: Parameter Tuning (Key Insights)

In [None]:
# Key insights from parameter tuning:

# 1. Beam search (num_beams > 1) generally produces more coherent summaries
#    but takes longer to generate

# 2. Higher length_penalty values encourage longer summaries

# 3. Sampling with low temperature (0.3-0.7) can add variety while
#    maintaining quality

# 4. For factual content, avoid high temperature (>1.0) to reduce
#    hallucination risk

# Best practices:
best_params_factual = {
    "num_beams": 4,
    "do_sample": False,
    "length_penalty": 1.0,
    "repetition_penalty": 1.2,
}

best_params_creative = {
    "num_beams": 1,
    "do_sample": True,
    "temperature": 0.7,
    "top_k": 50,
    "top_p": 0.9,
}

print("Recommended parameters for factual summaries:")
for k, v in best_params_factual.items():
    print(f"  {k}: {v}")

print("\nRecommended parameters for creative summaries:")
for k, v in best_params_creative.items():
    print(f"  {k}: {v}")

---

## Additional Resources

- [Hugging Face Summarization Docs](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.SummarizationPipeline)
- [BART Paper](https://arxiv.org/abs/1910.13461) - Denoising Sequence-to-Sequence Pre-training
- [T5 Paper](https://arxiv.org/abs/1910.10683) - Text-to-Text Transfer Transformer
- [Pegasus Paper](https://arxiv.org/abs/1912.08777) - Pre-training with Extracted Gap-sentences for Summarization
- [Hallucination in Summarization](https://arxiv.org/abs/2005.00661) - Understanding and mitigating hallucinations