# Microsoft's Maira-2 LLM for Radiology Report Generation
* Notebook by Adam Lang
* Date: 4/17/2025

# Background
* Maira-2 is an open source multimodal transformer model developed by Microsoft Research Health Futures.
* It is open sourced on huggingface here: https://huggingface.co/microsoft/maira-2
* This model is not for commercial or clinical use but for experimentation and research only.
* Original paper: https://arxiv.org/abs/2406.04449


# Training
* **This is reported direct from the huggingface repo:**
  * *The model was trained on chest X-ray report datasets from Spain (translated from the original Spanish to English) and the USAReporting styles, patient demographics and disease prevalence, and image acquisition protocols can vary across health systems and regions. These factors will impact the generalisability of the model.*


# Usage
* **This is reported from the huggingface repo:**

1. MAIRA-2 takes a frontal chest X-ray, and any of the following:
  * A lateral view from the current study
  * A frontal view from the prior study, with accompanying prior report
  * The indication for the current study
  * The technique and comparison sections for the current study

2. MAIRA-2 can generate the findings section of the current study, in one of two forms:
  * **Narrative text**
    * This is without any image annotations (this is the typical report generation scenario).

  * **Grounded report**
    * All described findings are accompanied by zero or more bounding boxes indicating their location on the current frontal image.

3. MAIRA-2 can also perform **phrase grounding.**
  * In this case, it must also be provided with an input phrase.
  * It will then repeat the phrase and generate a bounding box localising the finding described in the phrase.


# Critical Analysis
* The original paper gives a thorough evaluation of MAIRA-2 and demonstrates its superiority over other state-of-the-art models.
* However, the paper's authors acknowledge several limitations and areas for further research:

1. **Dataset Bias**
  * Maira-2 was trained and evaluated on datasets that may not fully represent the diversity of real-world radiology cases.
  * This could lead to biases in the outputs.

2. **Lack of Clinical Validation**
  * The authors reveal that MAIRA-2 outperforms other models on different metrics, however, they **do not provide any clinical validation to ensure the reports are actually useful and accurate from a medical perspective.**

3. **Clinical Interpretability**
  * The inner workings of MAIRA-2 are not fully interpretable as a neural network model.
  * This makes it very difficult for medical professionals to understand and trust the model's decisions.

4. **RadFact -- Machine Learning Explainability**
  * RadFact was develped by the authors for the evaluation of model-generated radiology reports given a ground-truth report, which enables evaluation of grounding annotations.
  * This framework relies on the logical inference abilities of LLMs.
  * The full suite of metrics is on their github: https://github.com/microsoft/RadFact
  * The authors stated they used `Llama3-70B-Instruct (https://huggingface.co/meta-llama/Meta-Llama3-70B-Instruct) for entailment verification with ten in-context examples.

5. The paper does not address the potential challenges of deploying an automated radiology report generation system in a real-world clinical setting, such as data privacy concerns, integration with existing workflows, and the need for ongoing monitoring and maintenance.

6. Data Bias
  * The model was trained and tested on MIMIC-CXR data which is mainly dominated by reports from the intensive care unit (ICU) and inpatient setting and may not reflect real world scenarios for every case.



# Install Dependencies

In [1]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git@88d960937c81a32bfb63356a2e8ecf7999619681 gradio

# Import Libraries

In [2]:
## imports
from transformers import AutoModelForCausalLM, AutoProcessor
from pathlib import Path
import torch

# Hugging Face Token Login

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
## load model
model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True)

## load processor
processor = AutoProcessor.from_pretrained("microsoft/maira-2", trust_remote_code=True)

config.json:   0%|          | 0.00/5.48k [00:00<?, ?B/s]

configuration_maira2.py:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/maira-2:
- configuration_maira2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_maira2.py:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/maira-2:
- modeling_maira2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/49.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/3.14G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/369 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

processing_maira2.py:   0%|          | 0.00/27.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/maira-2:
- processing_maira2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


preprocessor_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/37.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

# Get Sample Data
* These are open source images from the National Library of Medicine: https://openi.nlm.nih.gov/
* There are 2 images that will be fed to the model
  * 1. Frontal chest x-ray
  * 2. Lateral chest x-ray

In [6]:
## imports
import requests
from PIL import Image


## get open source CXR data from National Library of Medicine
def get_sample_data() -> dict[str, Image.Image | str]:
    """
    Download chest X-rays from IU-Xray, which we didn't train MAIRA-2 on. License is CC.
    We modified this function from the Rad-DINO repository on Huggingface.
    """
    frontal_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
    lateral_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"

    def download_and_open(url: str) -> Image.Image:
      ## send url request
        response = requests.get(url, headers={"User-Agent": "MAIRA-2"}, stream=True)
        return Image.open(response.raw)

    frontal_image = download_and_open(frontal_image_url)
    lateral_image = download_and_open(lateral_image_url)

    ## setup dict
    sample_data = {
        "frontal": frontal_image,
        "lateral": lateral_image,
        "indication": "Dyspnea.",
        "comparison": "None.",
        "technique": "PA and lateral views of the chest.",
        "phrase": "Pleural effusion."  # For the phrase grounding example. This patient has pleural effusion.
    }
    return sample_data

## run function to get data
sample_data = get_sample_data()

In [7]:
## get image meta data
sample_data

{'frontal': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x624>,
 'lateral': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x624>,
 'indication': 'Dyspnea.',
 'comparison': 'None.',
 'technique': 'PA and lateral views of the chest.',
 'phrase': 'Pleural effusion.'}

## Setup Device Agnostic Code in PyTorch
* Set device to cuda if not on cpu

In [8]:
## device agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## send model to device
model.to(device)

Maira2ForConditionalGeneration(
  (vision_tower): Dinov2Backbone(
    (embeddings): Dinov2Embeddings(
      (patch_embeddings): Dinov2PatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(14, 14), stride=(14, 14))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): Dinov2Encoder(
      (layer): ModuleList(
        (0-11): 12 x Dinov2Layer(
          (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (attention): Dinov2SdpaAttention(
            (attention): Dinov2SdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): Dinov2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inp

In [9]:
## run processor
processed_inputs = processor.format_and_preprocess_reporting_input(
    current_frontal=sample_data["frontal"],
    current_lateral=sample_data["lateral"],
    prior_frontal=None,  # Our example has no prior
    indication=sample_data["indication"],
    technique=sample_data["technique"],
    comparison=sample_data["comparison"],
    prior_report=None,  # Our example has no prior
    return_tensors="pt",
    get_grounding=False,  # For this example we generate a non-grounded report
)


## send processed_inputs to device
processed_inputs = processed_inputs.to(device)

## get raw logits and generate predictions
with torch.inference_mode():
    output_decoding = model.generate(
        **processed_inputs,
        max_new_tokens=450,  # Set to 450 for grounded reporting
        use_cache=True,
    )
prompt_length = processed_inputs["input_ids"].shape[-1]
decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
decoded_text = decoded_text.lstrip()  # Findings generation completions have a single leading space
prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)
print("Parsed prediction:", prediction)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Parsed prediction: There is a large right pleural effusion with associated right basilar atelectasis. The left lung is clear. No pneumothorax is identified. The cardiomediastinal silhouette and hilar contours are normal. There is no free air under the diaphragm. Surgical clips are noted in the right upper quadrant of the abdomen.


## Gradio App
* Setup a gradio application.

In [10]:
## downlaod image function
def download_image(url: str) -> Image.Image:
    """
    Download the image from the given URL and return as a PIL Image.
    """
    response = requests.get(url, headers={"User-Agent": "MAIRA-2"}, stream=True)
    return Image.open(response.raw)

In [11]:
## generate findings report func
def generate_findings(
    frontal_url: str,
    lateral_url: str,
    indication: str,
    comparison: str,
    technique: str
):
    """
    1. Download the frontal & lateral images from the provided URLs.
    2. Format & preprocess the input for the model using `processor`.
    3. Generate the findings from the model.
    4. Return the two images and the generated findings text.
    """
    # 1. Download images
    frontal_image = download_image(frontal_url)
    lateral_image = download_image(lateral_url)

    # 2. Prepare inputs for the model
    processed_inputs = processor.format_and_preprocess_reporting_input(
        current_frontal=frontal_image,
        current_lateral=lateral_image,
        prior_frontal=None,  # Example doesn't use prior images
        indication=indication,
        technique=technique,
        comparison=comparison,
        prior_report=None,   # Example doesn't use prior reports
        return_tensors="pt",
        get_grounding=False, # For a non-grounded report
    )
    processed_inputs = processed_inputs.to(model.device)

    # 3. Generate the findings
    with torch.no_grad():
        output_decoding = model.generate(
            **processed_inputs,
            max_new_tokens=450,
            use_cache=True,
        )

    # Skip the prompt portion for a cleaner result
    prompt_length = processed_inputs["input_ids"].shape[-1]
    decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
    decoded_text = decoded_text.lstrip()

    # Convert the model output into plain text
    prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)

    # Return:
    # - frontal/lateral images so they can be displayed in Gradio
    # - the generated findings
    return frontal_image, lateral_image, prediction

## Code to run Gradio app

In [12]:
import gradio as gr

In [13]:
## gradio application code setup
app_name = "MAIRA-2 CXR Report Generator"
app_description = """
Enter URLs for the frontal and lateral chest X-ray images and relevant metadata.
Click "Generate Findings" to see the automatic radiology report findings.
"""

with gr.Blocks(title=app_name) as demo:
    gr.Markdown(f"## {app_name}")
    gr.Markdown(app_description)

    with gr.Row():
        frontal_url = gr.Textbox(
            label="Frontal Image URL",
            value="https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
        )
        lateral_url = gr.Textbox(
            label="Lateral Image URL",
            value="https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"
        )

    indication = gr.Textbox(label="Indication", value="Dyspnea.")
    comparison = gr.Textbox(label="Comparison", value="None.")
    technique = gr.Textbox(label="Technique", value="PA and lateral views of the chest.")

    generate_button = gr.Button("Generate Findings")

    with gr.Row():
        frontal_image_out = gr.Image(label="Frontal Image")
        lateral_image_out = gr.Image(label="Lateral Image")
    result_text_out = gr.Textbox(label="Generated Findings", lines=6)

    ## generate
    generate_button.click(
        fn=generate_findings,
        inputs=[frontal_url, lateral_url, indication, comparison, technique],
        outputs=[frontal_image_out, lateral_image_out, result_text_out]
    )

In [14]:
if __name__ == "__main__":
    demo.launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7b154e76fab582669c.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
