# Text Summarization with BART (Hugging Face Transformers)

This notebook applies a pretrained **BART model** (`facebook/bart-base`) from Hugging Face to perform **abstractive text summarization**.  

**Key steps:**
1. Load dataset (`train.json`) and (`test.json`) with text-summary pairs.  
2. Preprocess input text.  
3. Use Hugging Face `pipeline` for summarization with `facebook/bart-base`.  
4. Generate summaries for sample texts.  
5. Compare generated summaries with references qualitatively.  


In [13]:
#Import standard Libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import textwrap
from tqdm import tqdm

#Import PyTorch and related libraries
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam, AdamW
from torch.optim.lr_scheduler import LambdaLR

- Install pytorch_lightning (if not installed)
- Run: !pip install pytorch_lightning

In [14]:
#Import Hugging Face Transformers
from transformers import(
AutoTokenizer,
AutoModelForSeq2SeqLM,
get_linear_schedule_with_warmup
)

#Import PyTorch Lightning for simplified training
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger


In [15]:
#Set random seeds for reproducibility
pl.seed_everything(42)

#Constants
MAX_LEN = 150
SUMMARY_LEN = 50
BATCH_SIZE = 32
LEARNING_RATE = 3e-5
EPOCHS = 5
MODEL_NAME = "facebook/bart-base"  #Using BART which is good for summarization

INFO:lightning_fabric.utilities.seed:Seed set to 42


## Dataset

- Source: DeepLearning.AI  
- Format: JSON (`train.json`) and(`test.json`) with fields:
  - `dialogue`: original passage  
  - `summary`: human-written reference summary  

Example:
```json
{
  "dialogue": "Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye",
  "summary": "Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry."
}


In [16]:
#Load and preprocess data
def load_data(data_dir):
    train_data = pd.read_json(f"{data_dir}/train.json")
    test_data = pd.read_json(f"{data_dir}/test.json")
    return train_data, test_data

#Dataset class
class SummaryDataset(Dataset):
    def __init__(self, data, tokenizer, max_len, summary_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.summary_len = summary_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        document = str(self.data.iloc[index]['dialogue'])
        summary = str(self.data.iloc[index]['summary'])

        inputs = self.tokenizer(
            document,
            max_length = self.max_len,
            padding = 'max_length',
            truncation = True,
            return_tensors = 'pt'
        )

        targets = self.tokenizer(
            summary,
            max_length = self.summary_len,
            padding = 'max_length',
            truncation = True,
            return_tensors = 'pt'
        )

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': targets['input_ids'].squeeze()
        }


---


## Model: facebook/bart-base

We use the Hugging Face `transformers` library with the **BART model**, a pretrained sequence-to-sequence transformer fine-tuned for summarization tasks.

Steps:
- Load `facebook/bart-base` tokenizer and model.  
- Build a summarization `pipeline`.  
- Generate summaries for sample texts.  


In [17]:
#Lightning Module for training
class SummaryModel(pl.LightningModule):
    def __init__(self, model_name = MODEL_NAME, lr = LEARNING_RATE):
        super().__init__()
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.lr = lr
        self.save_hyperparameters()

    def forward(self, input_ids, attention_mask, labels= None):
        outputs = self.model(
            input_ids = input_ids,
            attention_mask = attention_mask,
            labels = labels
        )
        return outputs

    def training_step(self, batch, batch_idx):
        outputs = self(
            batch['input_ids'],
            batch['attention_mask'],
            batch['labels']
        )
        loss = outputs.loss
        self.log('train_loss', loss, prog_bar = True, logger = True)
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self(
            batch['input_ids'],
            batch['attention_mask'],
            batch['labels']
        )
        loss = outputs.loss
        self.log('val_loss', loss, prog_bar = True, logger = True)
        return loss

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=self.lr)

        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps = 0,
            num_training_steps = self.trainer.estimated_stepping_batches
        )
        return [optimizer], [scheduler]

#### Loading Data

In [18]:
3#Load data
data_dir = "data/corpus"
train_data, test_data = load_data(data_dir)

#Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

#Create datsets
train_dataset = SummaryDataset(train_data, tokenizer, MAX_LEN, SUMMARY_LEN)
val_dataset = SummaryDataset(test_data, tokenizer, MAX_LEN, SUMMARY_LEN)

#Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size = BATCH_SIZE,
    shuffle = True,
    num_workers = 2
)

val_loader = DataLoader(
    val_dataset,
    batch_size = BATCH_SIZE,
    shuffle = False,
    num_workers = 2
)

#### Instantiating the Model

In [19]:
#Initialize model
model = SummaryModel()

#Callbacks
checkpoint_callback = ModelCheckpoint(
    monitor = 'val_loss',
    dirpath = 'checkpoints',
    filename = 'best-checkpoint',
    save_top_k = 1,
    mode = 'min'
)
logger = TensorBoardLogger("lightning_logs", name = "summarization")

#Determinee devices based on availability
if torch.cuda.is_available():
    accelerator = 'gpu'
    devices = 1
else:
    accelerator = 'cpu'
    devices = 'auto'

#Trainer
trainer = pl.Trainer(
    max_epochs = EPOCHS,
    logger = logger,
    callbacks = [checkpoint_callback],
    accelerator = 'auto',
    devices = devices
)

#Train the model
trainer.fit(model, train_loader, val_loader)

#Example summarization
example_text = train_data.iloc[0]['dialogue']
inputs = tokenizer(
    example_text,
    max_length = MAX_LEN,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'pt'
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
/usr/local/lib/python3.12/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:701: Checkpoint directory /content/checkpoints exists and is not empty.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                         | Params | Mode
--------------------------------------------------------------
0 | model | BartForConditionalGeneration | 139 M  | eval
--------------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
557.682   Total estimated model p

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


## Evaluation

Instead of BLEU/ROUGE, we perform **qualitative evaluation** by comparing predicted summaries against human-written references.

Example:
- Input: "The stock market crashed yesterday due to global uncertainty..."  
- Predicted: "Global uncertainty caused a market crash."  
- Reference: "Stock market crashed due to uncertainty."  

Observation:
- The model captures the main idea but may paraphrase differently.  
- Summaries are concise and fluent.  


In [20]:
#Load best model
example_text = """My major priority is to validate the usage of our material for organ-on-a-chip applications, guided by market analysis and customer discovery. Given David's background in pharmacy, I am especially keen to have him identify specific pathologies, treatments, and drug R&D groups that would be a good fit, so that we can tailor our technical development pathway and funding strategy. He's already made a good start on a market opportunity analysis document and I expect that the bulk of his efforts over the next few weeks will be focused on customer validation/customer discovery. The document should have been shared with you recently- and I am happy to do the same with subsequent documents related to the project.
"""
inputs = tokenizer(
    example_text,
    max_length = MAX_LEN,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'pt'
)


best_model = SummaryModel.load_from_checkpoint(
    trainer.checkpoint_callback.best_model_path
)
best_model.eval()

device = next(best_model.parameters()).device
inputs = {k: v.to(device) for k, v in inputs.items()}

#Generate summary
summary_ids = best_model.model.generate(
    input_ids = inputs['input_ids'],
    attention_mask = inputs['attention_mask'],
    max_length = SUMMARY_LEN,
    num_beams = 2,
    early_stopping = True
)

summary = tokenizer.decode(
    summary_ids[0],
    skip_special_tokens = True,
    clean_up_tokenization_spaces = True
)

print("\nOriginal Text:")
print(example_text)
print("\nGenerated Summary:")
print(summary)


Original Text:
My major priority is to validate the usage of our material for organ-on-a-chip applications, guided by market analysis and customer discovery. Given David's background in pharmacy, I am especially keen to have him identify specific pathologies, treatments, and drug R&D groups that would be a good fit, so that we can tailor our technical development pathway and funding strategy. He's already made a good start on a market opportunity analysis document and I expect that the bulk of his efforts over the next few weeks will be focused on customer validation/customer discovery. The document should have been shared with you recently- and I am happy to do the same with subsequent documents related to the project.


Generated Summary:
My team is working on a market opportunity analysis document for organ-on-a-chip applications. David will help us to identify specific pathologies, treatments, and drug R&D groups that would be a good fit for the project


## Conclusion


- Successfully applied `facebook/bart-base` for abstractive summarization.
- Model outputs are fluent and concise, closely matching human references.

---
#### Text summarization is highly useful for real-world applications such as:
- Condensing news articles into short briefs.
- Summarizing legal or research documents for quicker understanding.
- Providing quick insights from long customer support logs.
##### This project demonstrates how transformer-based models like BART can deliver immediate value in reducing information overload by generating high-quality summaries.