CXR Training Tool

An interactive chest X-ray (CXR) teaching application for medical imaging education. This web application provides medical students and healthcare professionals with an AI-powered platform to practice interpreting chest X-rays with immediate feedback.

Features

Random Case Selection: Practice with ~4,300 real chest X-ray images from the NIH ChestX-ray14 dataset
Similar Case Hints: View similar cases before submitting to aid learning
AI Tutor Feedback: Get educational explanations powered by Azure OpenAI (GPT-5.1)
Structured Radiology Reports: View CXRReportGen findings with bounding box visualizations
Similarity Search: Cases indexed using MedImageInsight embeddings for intelligent recommendations
Educational Focus: Designed for learning, not clinical use

Dataset

Source Data

The chest X-ray images are sourced from the NIH ChestX-ray14 dataset available on Kaggle:

Original Dataset: NIH Chest X-rays on Kaggle
Total Original Dataset: 112,120 frontal-view X-ray images from 30,805 unique patients

Our Dataset Subset

For this educational application, we created a curated subset:

Selected Images: ~4,300 images randomly sampled from the full dataset
Selection Strategy: Stratified sampling to ensure adequate representation of each medical condition
Conditions Covered: All 14 pathology labels from the NIH dataset:
- Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema
- Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening
- Pneumonia, Pneumothorax
Normal Cases: Included to provide balanced learning scenarios

Data Processing Pipeline

Image Selection & Upload (one-time manual process):
- Downloaded ~4,300 images from Kaggle dataset
- Ensured balanced representation across all 14 conditions
- Uploaded images to Azure Blob Storage for centralized access
- Created index CSV (nih_demo_index.csv) with metadata and labels
AI Model Processing (scripts/build_index.py):
- Generated MedImageInsight embeddings (1024-dimensional vectors) for each image
- Generated CXRReportGen structured reports with findings and bounding boxes
- Built Nearest Neighbor index using scikit-learn for similarity search
- Saved all artifacts to data/ directory for application use
Deployment (scripts/download_data.py):
- Pre-processed data artifacts are stored in Azure Blob Storage
- Downloaded automatically during application startup
- No runtime AI model calls needed (all pre-computed)

Architecture

Technology Stack

Backend: FastAPI (Python 3.10+)
Frontend: Vanilla JavaScript, HTML5, CSS3
AI Services: Azure OpenAI, MedImageInsight, CXRReportGen
Data Storage: Azure Blob Storage
Hosting: Azure App Service (Linux)

Project Structure

cxr-training-tool/
├── app/
│   ├── main.py                 # FastAPI application & API endpoints
│   ├── models.py               # Pydantic request/response models
│   ├── core/
│   │   └── config.py          # Configuration & constants
│   └── services/
│       ├── cases.py           # Case data access layer
│       └── chat.py            # AI tutor integration
├── scripts/
│   ├── download_data.py       # Download data from Azure Blob Storage
│   ├── data_utils.py          # Dataset wrapper utilities
│   └── build_index.py         # Build embeddings & NN index
├── static/
│   ├── index.html             # Frontend UI
│   └── app.js                 # Frontend logic
├── data/                      # Data files (not in git)
│   ├── embeddings_complete.npy
│   ├── metadata_complete.parquet
│   ├── cxr_reports_complete.json
│   └── nn_index_complete.pkl
├── startup.sh                 # Azure App Service startup script
├── requirements.txt           # Python dependencies
└── README.md

Setup & Installation

Prerequisites

Python 3.10 or higher
Azure account with access to:
- Azure Blob Storage (for CXR images and data)
- Azure OpenAI (for GPT-5.1)
- MedImageInsight endpoint
- CXRReportGen endpoint

Local Development Setup

Clone the repository

git clone <repository-url>
cd cxr-training-tool

Create virtual environment

python -m venv venv
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```

Configure environment variables

Create a .env file in the project root:

# Azure Storage
AZURE_STORAGE_ACCOUNT_URL=https://<account>.blob.core.windows.net
AZURE_STORAGE_CONTAINER=nih-cxr

# MedImageInsight (optional - for building embeddings)
MEDIMAGEINSIGHT_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/
MEDIMAGEINSIGHT_KEY=<your-key>

# CXRReportGen (optional - for generating reports)
CXRREPORTGEN_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/
CXRREPORTGEN_KEY=<your-key>

# Chat Model (Azure OpenAI)
CHAT_ENDPOINT=https://<resource>.openai.azure.com/openai/deployments/<deployment>/
CHAT_API_KEY=<your-api-key>
CHAT_MODEL_NAME=gpt-5.1

# Data Directory (optional - defaults to 'data')
DATA_DIR=data

Download data files
```
python scripts/download_data.py
```

Run the application

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Open in browser
```
http://localhost:8000
```

Configuration

Application Constants

Configure in app/core/config.py:

DEFAULT_SIMILAR_CASES = 6 - Number of similar cases shown by default
MAX_SIMILAR_CASES = 20 - Maximum similar cases allowed per request
MAX_EXPLANATION_TOKENS = 300 - Max tokens for AI explanations
MAX_REASONING_LENGTH = 2000 - Max length for student reasoning text
IMAGE_CACHE_MAX_AGE = 3600 - Browser cache duration for images (seconds)

Environment Variables

Variable	Required	Description
`AZURE_STORAGE_ACCOUNT_URL`	Yes	Azure Blob Storage account URL
`AZURE_STORAGE_CONTAINER`	Yes	Container name for CXR images
`CHAT_ENDPOINT`	Yes	Azure OpenAI endpoint URL
`CHAT_API_KEY`	Yes	Azure OpenAI API key
`CHAT_MODEL_NAME`	Yes	Model name (default: gpt-5.1)
`DATA_DIR`	No	Data directory path (default: data)
`MEDIMAGEINSIGHT_ENDPOINT`	No	Only needed for building embeddings
`MEDIMAGEINSIGHT_KEY`	No	Only needed for building embeddings
`CXRREPORTGEN_ENDPOINT`	No	Only needed for generating reports
`CXRREPORTGEN_KEY`	No	Only needed for generating reports

API Endpoints

Health Check

GET / - Redirects to frontend
GET /api/health - Health check with dataset stats

Case Management

GET /api/case/random - Get a random CXR case
GET /api/case/{image_id}/hint?k=6 - Get similar cases (hint mode)
POST /api/case/{image_id}/submit - Submit diagnosis and get feedback

Image Proxy

GET /api/image/{image_id} - Proxy endpoint for CXR images

API Documentation

GET /docs - Interactive Swagger UI documentation
GET /redoc - ReDoc documentation

Testing

Run unit tests:

# Test services
python tests/dev/test_cases_service.py
python tests/dev/test_chat_service.py

# Test API endpoints (requires running server)
python tests/dev/test_api_endpoints.py

Data Pipeline & Scripts

The application uses pre-generated data files. Here's how the data processing works:

Available Scripts

1. `scripts/build_index.py` - Main Data Processing Pipeline

Purpose: Process raw images and generate all AI-powered artifacts

What it does:

Reads data/nih_demo_index.csv (index of selected images with metadata)
Downloads each image from Azure Blob Storage
Calls MedImageInsight API to generate 1024-dimensional embeddings
Calls CXRReportGen API to generate structured radiology reports with bounding boxes
Builds scikit-learn NearestNeighbors index for fast similarity search
Saves results periodically (every 50 images) to prevent data loss
Logs progress to logs/build_index_<timestamp>.log

Output files (saved to data/):

embeddings_complete.npy - NumPy array of image embeddings (4,300 × 1024)
metadata_complete.parquet - Parquet file with image metadata and labels
cxr_reports_complete.json - JSON with structured radiology reports
nn_index_complete.pkl - Joblib-serialized nearest neighbor search index

Usage:

python scripts/build_index.py

Requirements:

MEDIMAGEINSIGHT_ENDPOINT and MEDIMAGEINSIGHT_KEY environment variables
CXRREPORTGEN_ENDPOINT and CXRREPORTGEN_KEY environment variables
data/nih_demo_index.csv input file
Images already uploaded to Azure Blob Storage

Processing time: ~2-4 hours for 4,300 images (depending on API response times)

2. `scripts/download_data.py` - Deployment Data Downloader

Purpose: Download pre-processed data artifacts during application startup

What it does:

Downloads the 4 output files from Azure Blob Storage
Skips files that already exist locally
Used by startup.sh during Azure App Service deployment
Much faster than regenerating (~2-5 minutes vs hours)

Usage:

python scripts/download_data.py

3. `scripts/data_utils.py` - Data Access Utilities

Purpose: Provides clean Python interface to load and query the processed data

What it provides:

CXRDataset class - unified interface to all data files
Validates data alignment (embeddings, metadata, reports all match)
Methods: get_metadata(), get_embedding(), get_report(), find_similar()
Used by app/services/cases.py in the main application

Rebuilding from Scratch

If you need to regenerate the data pipeline:

Prepare Input CSV (data/nih_demo_index.csv):

Image_ID,Image_Filename,Azure_Path,Binary_Label,Main_Labels,Multi_Class,Condition_Count,Source
1,00000001_000.png,images/00000001_000.png,Abnormal,Cardiomegaly,Cardiomegaly,1,NIH
2,00000002_001.png,images/00000002_001.png,Normal,,No Finding,0,NIH
...

Upload Images to Azure Blob Storage:
- Container: As specified in AZURE_STORAGE_CONTAINER
- Path: Match the Azure_Path column in your CSV

Configure API Keys:

export MEDIMAGEINSIGHT_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
export MEDIMAGEINSIGHT_KEY="<your-key>"
export CXRREPORTGEN_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
export CXRREPORTGEN_KEY="<your-key>"

Run Pipeline:
```
python scripts/build_index.py
```

Upload Results to Blob Storage (for deployment):

# Upload the 4 generated files to Azure Blob Storage under app-data/ prefix
az storage blob upload --account-name <account> --container-name <container> \
  --name app-data/embeddings_complete.npy --file data/embeddings_complete.npy
# ... repeat for other files

What's NOT in the Scripts

The following one-time manual processes were performed but scripts are not included:

Image Selection Script:

Downloaded 4,300+ images from Kaggle NIH ChestX-ray14 dataset
Implemented stratified sampling to ensure balanced condition representation
Uploaded selected images to Azure Blob Storage

Index CSV Creation:

Parsed original NIH dataset metadata
Matched image filenames to labels
Created nih_demo_index.csv with aligned metadata

Educational Use Only

Important Disclaimer

This application is designed for educational purposes only. It should not be used for:

Clinical diagnosis or treatment decisions
Patient care or medical advice
Any situation requiring FDA-approved medical devices
Processing actual protected health information (PHI)

The AI-generated explanations and reports are for teaching pattern recognition in radiology, not clinical decision support.

Acknowledgments

NIH ChestX-ray14 Dataset: Chest X-ray images from NIH Clinical Center
MedImageInsight: Medical image embedding model
CXRReportGen: Structured chest X-ray report generation
Azure OpenAI: GPT-5.1 for educational feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CXR Training Tool

Features

Dataset

Source Data

Our Dataset Subset

Data Processing Pipeline

Architecture

Technology Stack

Project Structure

Setup & Installation

Prerequisites

Local Development Setup

Configuration

Application Constants

Environment Variables

API Endpoints

Health Check

Case Management

Image Proxy

API Documentation

Testing

Data Pipeline & Scripts

Available Scripts

1. `scripts/build_index.py` - Main Data Processing Pipeline

2. `scripts/download_data.py` - Deployment Data Downloader

3. `scripts/data_utils.py` - Data Access Utilities

Rebuilding from Scratch

What's NOT in the Scripts

Educational Use Only

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
data		data
scripts		scripts
static		static
tests		tests
.gitignore		.gitignore
.gitkeep_data		.gitkeep_data
README.md		README.md
requirements.txt		requirements.txt
startup.sh		startup.sh
test_commit.txt		test_commit.txt

ariel-ms-testing/cxr-training-multi-model

Folders and files

Latest commit

History

Repository files navigation

CXR Training Tool

Features

Dataset

Source Data

Our Dataset Subset

Data Processing Pipeline

Architecture

Technology Stack

Project Structure

Setup & Installation

Prerequisites

Local Development Setup

Configuration

Application Constants

Environment Variables

API Endpoints

Health Check

Case Management

Image Proxy

API Documentation

Testing

Data Pipeline & Scripts

Available Scripts

1. scripts/build_index.py - Main Data Processing Pipeline

2. scripts/download_data.py - Deployment Data Downloader

3. scripts/data_utils.py - Data Access Utilities

Rebuilding from Scratch

What's NOT in the Scripts

Educational Use Only

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `scripts/build_index.py` - Main Data Processing Pipeline

2. `scripts/download_data.py` - Deployment Data Downloader

3. `scripts/data_utils.py` - Data Access Utilities

Packages