# BLIP2 + T5 Radiology Report Generation (LoRA/PEFT Fine-tuning)
This notebook demonstrates how to use BLIP2 for image captioning and T5 for radiology report generation, with lightweight LoRA fine-tuning on a small subset of the MIMIC-CXR dataset.

In [2]:
!pip install transformers peft datasets torch evaluate scikit-learn pillow

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuspa

In [1]:
import os
import random
import torch
from datasets import load_dataset
from transformers import Blip2Processor, Blip2ForConditionalGeneration, T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from evaluate import load as load_metric
from sklearn.model_selection import train_test_split
from PIL import Image

2025-07-05 04:29:27.423330: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751689767.449892     156 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751689767.457885     156 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 1. Load a Subset of the Dataset
We use only 200 rows for quick experimentation.

In [2]:
raw_ds = load_dataset("itsanmolgupta/mimic-cxr-dataset-cleaned")
all_indices = list(range(sum([len(raw_ds[split]) for split in raw_ds.keys()])))
random.seed(42)
sample_indices = random.sample(all_indices, 200)
# Merge all splits and select 200 rows
all_rows = []
for split in raw_ds.keys():
    for i in range(len(raw_ds[split])):
        all_rows.append(raw_ds[split][i])
sampled_rows = [all_rows[i] for i in sample_indices]

## 2. Train/Test Split
Split the 200 rows into 160 for training and 40 for testing.

In [3]:
from sklearn.model_selection import train_test_split
train_rows, test_rows = train_test_split(sampled_rows, test_size=0.2, random_state=42)

## 3. Load BLIP2 and T5 Models
We use BLIP2 for image captioning and T5 for report generation.

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
blip2_model_path = 'Salesforce/blip2-opt-2.7b'
t5_model_path = 't5-base'
blip2_processor = Blip2Processor.from_pretrained(blip2_model_path)
blip2_model = Blip2ForConditionalGeneration.from_pretrained(blip2_model_path, torch_dtype=torch.float16 if device=='cuda' else torch.float32, device_map='auto')
t5_tokenizer = T5Tokenizer.from_pretrained(t5_model_path)
t5_model = T5ForConditionalGeneration.from_pretrained(t5_model_path).to(device)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## 4. Preprocessing: BLIP2 Caption + T5 Input
Each image is captioned by BLIP2, then T5 generates the report from the caption + findings/impression.

In [5]:
def preprocess(example):
    image = example['image']
    if isinstance(image, str):
        image = Image.open(image).convert('RGB')
    # BLIP2: Generate caption
    blip2_inputs = blip2_processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        generated_ids = blip2_model.generate(**blip2_inputs, max_new_tokens=32)
        caption = blip2_processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    findings = example.get('findings', '')
    impression = example.get('impression', '')
    t5_input = f"Image Caption: {caption}. Findings: {findings}. Impression: {impression}."
    t5_target = f"Findings: {findings}. Impression: {impression}."
    t5_enc = t5_tokenizer(t5_input, return_tensors="pt", padding="max_length", truncation=True, max_length=256)
    t5_labels = t5_tokenizer(t5_target, return_tensors="pt", padding="max_length", truncation=True, max_length=128).input_ids
    return {"input_ids": t5_enc.input_ids.squeeze(0), "attention_mask": t5_enc.attention_mask.squeeze(0), "labels": t5_labels.squeeze(0)}

## 5. Evaluation Metrics
We use BLEU and ROUGE for report quality.

In [13]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=7436f6941802b524ad2aa001035598d213dce902ea398df609f62ccf04438e0d
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [6]:
bleu = load_metric('bleu')
rouge = load_metric('rouge')
def compute_metrics(preds, labels):
    decoded_preds = [t5_tokenizer.decode(p, skip_special_tokens=True) for p in preds]
    decoded_labels = [t5_tokenizer.decode(l, skip_special_tokens=True) for l in labels]
    bleu_score = bleu.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])['bleu']
    rouge_score = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return {'bleu': bleu_score, 'rougeL': rouge_score['rougeL']}

## 6. Inference Before Fine-tuning
Generate reports with T5 using BLIP2 captions, before any fine-tuning.

In [7]:
def generate_and_evaluate(dataset, t5_model, t5_tokenizer):
    preds, labels = [], []
    for ex in dataset:
        inputs = preprocess(ex)
        input_ids = inputs['input_ids'].unsqueeze(0).to(device)
        attention_mask = inputs['attention_mask'].unsqueeze(0).to(device)
        with torch.no_grad():
            generated_ids = t5_model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=128)
        preds.append(generated_ids[0])
        labels.append(inputs['labels'])
    return compute_metrics(preds, labels)
print("Evaluating before fine-tuning...")
pre_ft_metrics = generate_and_evaluate(test_rows, t5_model, t5_tokenizer)
print("Before fine-tuning:", pre_ft_metrics)



Evaluating before fine-tuning...




Before fine-tuning: {'bleu': 0.1524337503602974, 'rougeL': 0.36631172897810793}


## 7. LoRA/PEFT Fine-tuning (T5 Only)
We fine-tune T5 using LoRA for efficiency.

In [8]:
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=['q', 'v'], lora_dropout=0.05, bias='none', task_type='SEQ_2_SEQ_LM')
t5_model = prepare_model_for_kbit_training(t5_model)
t5_model = get_peft_model(t5_model, lora_config)
def collate_fn(batch):
    input_ids = torch.stack([b['input_ids'] for b in batch])
    attention_mask = torch.stack([b['attention_mask'] for b in batch])
    labels = torch.stack([b['labels'] for b in batch])
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}
train_data = [preprocess(ex) for ex in train_rows]



In [10]:
training_args = TrainingArguments(
    output_dir="./t5-lora-finetuned",
    per_device_train_batch_size=2,
    num_train_epochs=8,
    logging_steps=5,
    save_strategy="no",
    report_to=[],
    fp16=(device=="cuda"),
    remove_unused_columns=False
)
trainer = Trainer(
    model=t5_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=None,
    data_collator=collate_fn,
)
trainer.train()

No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
5,3.299
10,3.6225
15,4.3843
20,3.1489
25,2.9596
30,2.9048
35,3.0702
40,2.624
45,2.8332
50,2.6974


TrainOutput(global_step=320, training_loss=1.1995046004652976, metrics={'train_runtime': 124.477, 'train_samples_per_second': 10.283, 'train_steps_per_second': 2.571, 'total_flos': 391472511713280.0, 'train_loss': 1.1995046004652976, 'epoch': 8.0})

## 8. Inference After Fine-tuning
Evaluate T5 on the test set after LoRA fine-tuning.

In [11]:
print("Evaluating after fine-tuning...")
post_ft_metrics = generate_and_evaluate(test_rows, t5_model, t5_tokenizer)
print("After fine-tuning:", post_ft_metrics)



Evaluating after fine-tuning...




After fine-tuning: {'bleu': 0.7598732797841712, 'rougeL': 0.7913544056957874}


## 9. Save and Reload the Fine-tuned Model
Save the fine-tuned T5 model and tokenizer for future use.

In [12]:
save_dir = "./t5-lora-finetuned"
t5_model.save_pretrained(save_dir)
t5_tokenizer.save_pretrained(save_dir)
# To load later:
# from transformers import T5ForConditionalGeneration, T5Tokenizer
# from peft import PeftModel
# t5_model = T5ForConditionalGeneration.from_pretrained(save_dir)
# t5_model = PeftModel.from_pretrained(t5_model, save_dir)
# t5_tokenizer = T5Tokenizer.from_pretrained(save_dir)

('./t5-lora-finetuned/tokenizer_config.json',
 './t5-lora-finetuned/special_tokens_map.json',
 './t5-lora-finetuned/spiece.model',
 './t5-lora-finetuned/added_tokens.json')

## 10. Generate a Report from a Test Sample (BLIP2 + Fine-tuned T5)
Use the pipeline to generate a report from a sample in the test set after fine-tuning.

In [32]:
def generate_report_from_test_sample(test_sample, blip2_processor, blip2_model, t5_tokenizer, t5_model):
    image = test_sample['image']
    findings = test_sample.get('findings', '')
    impression = test_sample.get('impression', '')
    if isinstance(image, str):
        image = Image.open(image).convert('RGB')
    blip2_inputs = blip2_processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        generated_ids = blip2_model.generate(**blip2_inputs, max_new_tokens=64)
        caption = blip2_processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    t5_input = f"Image Caption: {caption}. Findings: {findings}. Impression: {impression}."
    t5_enc = t5_tokenizer(t5_input, return_tensors="pt").to(device)
    with torch.no_grad():
        output_ids = t5_model.generate(input_ids=t5_enc.input_ids, attention_mask=t5_enc.attention_mask, max_new_tokens=512)
        report = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return report

# Example: Generate report for the first sample in the test set
sample = test_rows[0]
generated_report = generate_report_from_test_sample(sample, blip2_processor, blip2_model, t5_tokenizer, t5_model)
print('Generated Report:', generated_report)
print('Ground Truth Findings:', sample.get('findings', ''))
print('Ground Truth Impression:', sample.get('impression', ''))



Generated Report: Findings: A small residual located air collection may be present at left base at the site of chest tube removal. A more vertically oriented lucent line at the left base is more likely to represent a skinfold rather than a large pneumothorax. Consolidation and small effusion at the right base are unchanged. An endotracheal tube remains in the upper airway. Nasogastric tube remains in the stomach. Thoracic spinal fusion and spacer hardware is stable. . Impression: No large left pneumothorax status post chest tube removal. .
Ground Truth Findings: A small residual loculated air collection may be present at the left base at the site of chest tube removal. A more vertically oriented lucent line at the left base is more likely to represent a skinfold rather than a large pneumothorax. Consolidation and small effusion at the right base are unchanged. A right internal jugular catheter remains at the cavoatrial junction. An endotracheal tube remains in the upper airway. Nasogas

In [29]:
def generate_report_from_test_sample(
    test_sample, blip2_processor, blip2_model, t5_tokenizer, t5_model, enforce_sections=True
):
    image = test_sample['image']
    findings = test_sample.get('findings', '')
    impression = test_sample.get('impression', '')
    # If image is a path, open it
    if isinstance(image, str):
        from PIL import Image
        image = Image.open(image).convert('RGB')
    # BLIP2: Generate caption
    blip2_inputs = blip2_processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        generated_ids = blip2_model.generate(**blip2_inputs, max_new_tokens=32)
        caption = blip2_processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    # Compose a more explicit T5 input prompt
    t5_input = (
        "Generate a radiology report with both Findings and Impression sections. "
        f"Image Caption: {caption}. Findings: {findings}. Impression: {impression}."
    )
    t5_enc = t5_tokenizer(t5_input, return_tensors="pt").to(device)
    with torch.no_grad():
        output_ids = t5_model.generate(input_ids=t5_enc.input_ids, attention_mask=t5_enc.attention_mask, max_new_tokens=1024)
        report = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    # Optionally enforce both sections in the output
    if enforce_sections:
        if "Impression:" not in report:
            report += "\nImpression: [No impression generated.]"
        if "Findings:" not in report:
            report = "Findings: [No findings generated.]\n" + report
    return report

# Example usage:
sample = test_rows[0]
generated_report = generate_report_from_test_sample(sample, blip2_processor, blip2_model, t5_tokenizer, t5_model)
print("Generated Report:", generated_report)
print("Ground Truth Findings:", sample.get('findings', ''))
print("Ground Truth Impression:", sample.get('impression', ''))



Generated Report: Impression: There may be a small loculated air collection at the chest tube site.. . Findings: A more vertically oriented lucent line at the left base is more likely to represent a skinfold rather than a large pneumothorax. Consolidation and small effusion at the right base are unchanged. An endotracheal catheter remains in the upper airway. Nasogastric tube remains in the stomach. Thoracic spinal fusion and spacer hardware is stable.. Image Caption: .
Ground Truth Findings: A small residual loculated air collection may be present at the left base at the site of chest tube removal. A more vertically oriented lucent line at the left base is more likely to represent a skinfold rather than a large pneumothorax. Consolidation and small effusion at the right base are unchanged. A right internal jugular catheter remains at the cavoatrial junction. An endotracheal tube remains in the upper airway. Nasogastric tube remains in the stomach. Thoracic spinal fusion and spacer har

In [18]:
def generate_structured_report_from_sample(
    test_sample, blip2_processor, blip2_model, t5_tokenizer, t5_model, enforce_sections=True
):
    image = test_sample['image']
    findings = test_sample.get('findings', '')
    impression = test_sample.get('impression', '')
    if isinstance(image, str):
        from PIL import Image
        image = Image.open(image).convert('RGB')
    # BLIP2: Generate caption
    blip2_inputs = blip2_processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        generated_ids = blip2_model.generate(**blip2_inputs, max_new_tokens=32)
        caption = blip2_processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    # Compose a highly structured prompt
    t5_input = (
        "You are a radiologist. Write a detailed, structured radiology report for the following chest X-ray. "
        "The report must include both a 'Findings:' section and an 'Impression:' section, each with as much relevant detail as possible. "
        f"Image Caption: {caption}\n"
        f"Findings: {findings}\n"
        f"Impression: {impression}\n"
        "Format:\nRadiology Report\n\nFindings:\n[findings]\n\nImpression:\n[impression]"
    )
    t5_enc = t5_tokenizer(t5_input, return_tensors="pt").to(device)
    with torch.no_grad():
        output_ids = t5_model.generate(
            input_ids=t5_enc.input_ids,
            attention_mask=t5_enc.attention_mask,
            max_new_tokens=256,
            temperature=0.7,
            top_p=0.95
        )
        report = t5_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    # Optionally enforce both sections in the output
    if enforce_sections:
        if "Impression:" not in report:
            report += "\nImpression: [No impression generated.]"
        if "Findings:" not in report:
            report = "Findings: [No findings generated.]\n" + report
    print("BLIP2 Caption:", caption)
    print("\nGenerated Radiology Report:\n", report)
    print("\nGround Truth Findings:", findings)
    print("Ground Truth Impression:", impression)
    return report

# Example usage:
sample = test_rows[0]
_ = generate_structured_report_from_sample(
    sample, blip2_processor, blip2_model, t5_tokenizer, t5_model
)



BLIP2 Caption: a chest x - ray shows the lungs and heart


Generated Radiology Report:
 lungs and heart. Image Caption: a chest x ray shows the lungs and heart. Findings: [findings] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: [impression] Impression: No large left pneumothorax status post chest tube removal. a small loculated air collection may be present at the left base

Ground Truth Findings: A small residual loculated air collection may be present at the left base at the site of chest tube removal. A more vertically oriented lucent line at the left base is more likely to represent a skinfold rather than a large pneumothorax. Consolidation and small effusion at the right base are unchanged. A right internal jugular catheter remains at the cavoatrial junction. An endotracheal tube remains in the upper airway. Nasogastr