# 🚀 Mini Transformer Summarizer Quickstart
Build a minimal, original‐Transformer‑based abstractive summarization pipeline in six clear steps: setup, data prep, model, training, evaluation, and inference. Focus on simplicity—get a working end‑to‑end system before diving into bells and whistles.

## 1. Environment & Dependencies
Install PyTorch (or TensorFlow) and tokenization utilities.

Why? You need a framework to define the Transformer layers and run training; tokenizers to map text ↔ tokens.


In [None]:
"""
!pip install torch torchvision transformers sentencepiece
!pip install torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
!pip install xformers

"""
!pip install torch transformers sentencepiece datasets

## 2. Data Preparation
1. **Pick a dataset** (e.g. CNN/DailyMail or any small collection of articles + summaries).

2. **Clean & split into train / val / test.**

3. **Tokenize both source (article) and target (summary) with a shared vocabulary:**

Use SentencePiece or HuggingFace’s PreTrainedTokenizerFast.

Add special tokens: <s>, </s>, <pad>.

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset (using xsum as example)
dataset = load_dataset("xsum")
train_data = dataset["train"]
val_data = dataset["validation"]
test_data = dataset["test"]

# Initialize tokenizer (shared vocab for source & target)
tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512)

# Tokenization function
def tokenize_function(examples):
    model_inputs = tokenizer(
        examples["document"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=128,
            truncation=True,
            padding="max_length"
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
train_dataset = train_data.map(tokenize_function, batched=True)
val_dataset = val_data.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]



Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

## 3. Model Definition
Re‑implement (or borrow) the original Vaswani et al. Transformer:

- **Encoder:** stack of N layers, each with Multi‑Head Self‑Attention + Feed‑Forward

- **Decoder:** stack of N layers, each with Masked Self‑Attention + Encoder‑Decoder Attention + Feed‑Forward

- **Positional Encoding:** to inject token order information

In [3]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
"""

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2)) * (-math.log(10000.0) / d_model)
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(1)]

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.d_model = d_model

        # Linear projections for query, key, value
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)

        # Final linear layer
        self.out = nn.Linear(d_model, d_model)

        # Dropout for attention
        self.dropout = nn.Dropout(0.1)

    def forward(self, q, k, v, mask=None):
        bs = q.size(0)

        # Linear projections and split into heads
        q = self.q_linear(q).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(k).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(v).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Scaled Dot-Product Attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            # Apply mask (broadcastable to [batch, num_heads, seq_len, seq_len])
            scores = scores.masked_fill(mask == 0, -1e9)

        attn = torch.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        out = torch.matmul(attn, v)

        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(bs, -1, self.d_model)
        return self.out(out)

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        return self.linear2(self.dropout(torch.relu(self.linear1(x))))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output))

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, tgt_mask, src_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        return self.norm3(x + self.dropout(ff_output))

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=512):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position.float() * div_term)
        pe[:, 1::2] = torch.cos(position.float() * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(1)]
        return self.dropout(x)

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, num_heads=8,
                 num_layers=6, d_ff=2048, dropout=0.1):
        super().__init__()
        self.encoder_embed = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embed = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, dropout)

        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, tgt, src_mask, tgt_mask):
        src = self.dropout(self.pos_encoding(self.encoder_embed(src)))
        tgt = self.dropout(self.pos_encoding(self.decoder_embed(tgt)))

        for layer in self.encoder_layers:
            src = layer(src, src_mask)

        for layer in self.decoder_layers:
            tgt = layer(tgt, src, tgt_mask, src_mask)

        return self.fc_out(tgt)

    def forward(self, src, tgt, src_mask, tgt_mask):
        src = self.dropout(self.pos_encoding(self.encoder_embed(src)))
        tgt = self.dropout(self.pos_encoding(self.decoder_embed(tgt)))

        for layer in self.encoder_layers:
            src = layer(src, src_mask)

        for layer in self.decoder_layers:
            tgt = layer(tgt, src, tgt_mask, src_mask)

        return self.fc_out(tgt)

"""

In [None]:
"""
train_dataset.set_format(type='torch', columns=['input_ids', 'labels'])
val_dataset.set_format(type='torch', columns=['input_ids', 'labels'])

"""

## 4. Training Loop
1. **Prepare batches** of (src, tgt_input, tgt_output) where tgt_input is summary shifted right, and tgt_output is the next‑token target.

2. **Compute logits:** logits = model(src_tokens, tgt_input_tokens)

3. **Loss:** Cross‑entropy (ignore <pad> tokens)

4. **Optimize** with Adam and the “Noam” learning‑rate schedule

In [4]:
import wandb
wandb.login()

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    #evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    logging_dir="./logs",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfa-kashany[0m ([33mfa-kashany-uae[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
  trainer = Seq2SeqTrainer(


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,1.5295
1000,0.6975
1500,0.7
2000,0.6833
2500,0.6668
3000,0.6717
3500,0.673
4000,0.6646
4500,0.6623
5000,0.665


KeyboardInterrupt: 

In [None]:
"""

import torch.optim as optim
from torch.utils.data import DataLoader

def prepare_batch(batch):
    """Convert tokenized batch to tensors and create shifted targets"""
    src = batch["input_ids"]
    tgt = batch["labels"]

    # Create shifted right target input
    tgt_input = tgt[:, :-1]
    tgt_output = tgt[:, 1:]

    # Create masks - ensure proper dimensions and type
    src_mask = (src != tokenizer.pad_token_id).unsqueeze(1).unsqueeze(2)  # [batch, 1, 1, seq_len]
    tgt_mask = (tgt_input != tokenizer.pad_token_id).unsqueeze(1)  # [batch, 1, seq_len]

    # Causal mask - convert to bool before operation
    seq_len = tgt_input.size(1)
    causal_mask = torch.ones(1, seq_len, seq_len, device=src.device).tril().bool()  # [1, seq_len, seq_len]
    tgt_mask = tgt_mask.unsqueeze(2) & causal_mask  # [batch, 1, seq_len, seq_len]

    return src, tgt_input, tgt_output, src_mask, tgt_mask

class NoamOpt:
    """Noam learning rate schedule from original paper"""
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        if step is None:
            step = self._step
        return self.factor * (
            self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5))
        )

def train(model, train_dataset, val_dataset, tokenizer, epochs=10, batch_size=32):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Initialize optimizer and scheduler
    optimizer = optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
    scheduler = NoamOpt(
        model_size=512,
        factor=2,
        warmup=4000,
        optimizer=optimizer
    )

    criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        # Create DataLoader
        train_dataset.set_format(type='torch', columns=['input_ids', 'labels'])
        train_loader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=4,  # Parallel loading
            pin_memory=True if torch.cuda.is_available() else False
        )

        for batch in train_loader:
            src, tgt_input, tgt_output, src_mask, tgt_mask = prepare_batch(batch)
            src = src.to(device)
            tgt_input = tgt_input.to(device)
            tgt_output = tgt_output.to(device)
            src_mask = src_mask.to(device)
            tgt_mask = tgt_mask.to(device)

            optimizer.zero_grad()

            # Forward pass
            logits = model(src, tgt_input, src_mask, tgt_mask)

            # Reshape for loss calculation
            loss = criterion(
                logits.view(-1, logits.size(-1)),
                tgt_output.contiguous().view(-1)
            )

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            scheduler.step()
            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_loader)

        # Validation
        val_dataset.set_format(type='torch', columns=['input_ids', 'labels'])
        val_loss = evaluate(model, val_dataset, criterion, device, tokenizer)

        print(f"Epoch {epoch+1}: Train Loss = {avg_train_loss:.3f}, Val Loss = {val_loss:.3f}, LR = {scheduler._rate:.6f}")

def evaluate(model, val_dataset, criterion, device, tokenizer):
    model.eval()
    total_loss = 0
    val_loader = DataLoader(val_dataset, batch_size=32)

    with torch.no_grad():
        for batch in val_loader:
            src, tgt_input, tgt_output, src_mask, tgt_mask = prepare_batch(batch)
            src = src.to(device)
            tgt_input = tgt_input.to(device)
            tgt_output = tgt_output.to(device)
            src_mask = src_mask.to(device)
            tgt_mask = tgt_mask.to(device)

            logits = model(src, tgt_input, src_mask, tgt_mask)
            loss = criterion(
                logits.view(-1, logits.size(-1)),
                tgt_output.contiguous().view(-1)
            )
            total_loss += loss.item()

    return total_loss / len(val_loader)

# Initialize model
# 1. Verify tokenizer
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

# 2. Initialize model with debug
print("\nInitializing model...")
# Reduce model size if needed:
model = Transformer(
    src_vocab_size=tokenizer.vocab_size,
    tgt_vocab_size=tokenizer.vocab_size,
    d_model=256,  # Reduced from 512
    num_heads=4,  # Reduced from 8
    num_layers=3  # Reduced from 6
)

# 3. Verify parameters
params = list(model.parameters())
print(f"\nModel has {len(params)} parameter groups")
print(f"Total parameters: {sum(p.numel() for p in params):,}")

if len(params) == 0:
    print("\nERROR: No parameters found! Check:")
    print("- All submodules inherit from nn.Module")
    print("- All layers are properly initialized")
    print("- No parameters are being accidentally deleted")
else:
    print("\nModel initialized successfully. Starting training...")
    train(model, train_dataset, val_dataset, tokenizer, epochs=1)  # Test with 1 epoch first


"""

Tokenizer vocab size: 32100

Initializing model...

Model has 130 parameter groups
Total parameters: 33,366,372

Model initialized successfully. Starting training...


## 5. Evaluation
- **Automatic:** compute ROUGE‑1/2/L on the validation set every epoch.

- **Manual spot‑checks:** read a few source-summary pairs.

In [8]:
!pip install evaluate
!pip install rouge_score
import evaluate
import numpy as np

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    return {k: round(v * 100, 4) for k, v in result.items()}

trainer.compute_metrics = compute_metrics
eval_results = trainer.evaluate()
print(eval_results)

In [None]:
"""

import numpy as np
from rouge_score import rouge_scorer
from random import sample

def evaluate_rouge(model, val_data, tokenizer, device, num_samples=None):
    """Compute ROUGE scores on validation set"""
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    if num_samples:
        val_data = sample(list(val_data), num_samples)

    all_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in val_data:
            # Prepare input
            src = torch.tensor(batch["input_ids"]).unsqueeze(0).to(device)
            src_mask = (src != tokenizer.pad_token_id).unsqueeze(1).to(device)

            # Generate summary
            summary_ids = model.generate(
                src,
                src_mask,
                max_length=150,
                num_beams=4,
                early_stopping=True
            )

            # Decode texts
            source_text = tokenizer.decode(batch["input_ids"], skip_special_tokens=True)
            reference = tokenizer.decode(batch["labels"], skip_special_tokens=True)
            prediction = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

            # Compute ROUGE
            scores = scorer.score(reference, prediction)
            for key in all_scores:
                all_scores[key].append(scores[key].fmeasure)

    # Aggregate results
    avg_scores = {k: np.mean(v) for k, v in all_scores.items()}
    return avg_scores

def manual_spot_check(model, val_data, tokenizer, device, num_samples=3):
    """Print human-readable source-reference-prediction triples"""
    model.eval()
    samples = sample(list(val_data), num_samples)

    print("\nManual Spot Checks:")
    print("="*50)

    with torch.no_grad():
        for i, batch in enumerate(samples):
            src = torch.tensor(batch["input_ids"]).unsqueeze(0).to(device)
            src_mask = (src != tokenizer.pad_token_id).unsqueeze(1).to(device)

            summary_ids = model.generate(
                src,
                src_mask,
                max_length=150,
                num_beams=4,
                early_stopping=True
            )

            source_text = tokenizer.decode(batch["input_ids"], skip_special_tokens=True)
            reference = tokenizer.decode(batch["labels"], skip_special_tokens=True)
            prediction = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

            print(f"\nSample {i+1}:")
            print(f"\nSOURCE:\n{source_text[:500]}...")
            print(f"\nREFERENCE:\n{reference}")
            print(f"\nPREDICTION:\n{prediction}")
            print("-"*50)

# Integrated evaluation during training
def train_with_evaluation(model, train_data, val_data, tokenizer, epochs=10, batch_size=32):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    optimizer = optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
    scheduler = NoamOpt(512, 2, 4000, optimizer)
    criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

    for epoch in range(epochs):
        # Training phase (same as before)
        model.train()
        train_loss = train_epoch(model, train_data, optimizer, scheduler, criterion, device, batch_size)

        # Evaluation phase
        val_loss = evaluate_loss(model, val_data, criterion, device)
        rouge_scores = evaluate_rouge(model, val_data, tokenizer, device, num_samples=100)

        print(f"\nEpoch {epoch+1}:")
        print(f"Train Loss: {train_loss:.3f} | Val Loss: {val_loss:.3f}")
        print(f"ROUGE-1: {rouge_scores['rouge1']:.3f} | ROUGE-2: {rouge_scores['rouge2']:.3f} | ROUGE-L: {rouge_scores['rougeL']:.3f}")

        # Show samples every 2 epochs
        if (epoch + 1) % 2 == 0:
            manual_spot_check(model, val_data, tokenizer, device)

# Helper functions (extracted from previous training code)
def train_epoch(model, train_data, optimizer, scheduler, criterion, device, batch_size):
    # ... same training loop as before ...
    return total_loss / len(train_loader)

def evaluate_loss(model, val_data, criterion, device):
    # ... same loss evaluation as before ...
    return total_loss / len(val_loader)

# Example usage
train_with_evaluation(model, train_dataset, val_dataset, tokenizer)

"""

## 6. Inference
1. **Greedy / Beam Search over decoder output:**


`summary_ids = model.generate(
    src_tokens,
    max_length=128,
    num_beams=4,
    early_stopping=True,
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
`

2. **Post‑process:** strip extra spaces, ensure punctuation.



In [12]:
import torch
def generate_summary(text):
    # Move model to device if not already there
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize and move inputs to same device as model
    inputs = tokenizer(
        text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    ).to(device)  # This is the key fix

    # Generate summary
    outputs = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=150,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test with sample text
sample_article = "NASA announced the discovery of a new exoplanet orbiting a distant star..."
print(generate_summary(sample_article))

NASA has announced the discovery of an exoplanet orbiting a distant star.


In [None]:
"""

def generate_summary(
    model,
    source_text,
    tokenizer,
    device,
    max_length=128,
    num_beams=4,
    do_sample=False,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    no_repeat_ngram_size=0
):
    """
    Generate summary with options for different decoding strategies
    Args:
        model: Loaded Transformer model
        source_text: Raw input text to summarize
        tokenizer: Pre-trained tokenizer
        device: cuda/cpu
        max_length: Maximum summary length
        num_beams: Beam width (1=greedy)
        do_sample: Use sampling instead of beam search
        temperature: Softmax temperature
        top_k: Top-k sampling
        top_p: Nucleus sampling
        repetition_penalty: Penalize repeated tokens (>1.0)
        no_repeat_ngram_size: Block repeating n-grams
    Returns:
        Post-processed summary text
    """
    # Tokenize input
    inputs = tokenizer(
        source_text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    ).to(device)

    # Generate summary tokens
    summary_ids = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_length,
        num_beams=num_beams,
        early_stopping=True,
        do_sample=do_sample,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        no_repeat_ngram_size=no_repeat_ngram_size
    )

    # Decode and post-process
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return postprocess_summary(summary)

def postprocess_summary(text):
    """Clean up generated summary"""
    # Fix whitespace
    text = " ".join(text.split())

    # Ensure proper punctuation
    if text and text[-1] not in {".", "!", "?"}:
        text = text.rstrip() + "."

    # Capitalize first letter
    if text:
        text = text[0].upper() + text[1:]

    return text

# Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sample_text = """
NASA announced today the discovery of three new exoplanets orbiting a distant star
in the constellation Lyra. The planets, detected by the Kepler space telescope,
appear to be in the habitable zone where liquid water could exist. Scientists say...
"""

# Greedy decoding
greedy_summary = generate_summary(
    model,
    sample_text,
    tokenizer,
    device,
    num_beams=1
)

# Beam search
beam_summary = generate_summary(
    model,
    sample_text,
    tokenizer,
    device,
    num_beams=4
)

# Diverse beam search
diverse_summary = generate_summary(
    model,
    sample_text,
    tokenizer,
    device,
    num_beams=4,
    do_sample=True,
    temperature=0.7,
    top_k=50
)

print("Greedy:", greedy_summary)
print("Beam:", beam_summary)
print("Diverse:", diverse_summary)

"""