Skip to content

ariel-ms-testing/cxr-training-multi-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CXR Training Tool

An interactive chest X-ray (CXR) teaching application for medical imaging education. This web application provides medical students and healthcare professionals with an AI-powered platform to practice interpreting chest X-rays with immediate feedback.

Features

  • Random Case Selection: Practice with ~4,300 real chest X-ray images from the NIH ChestX-ray14 dataset
  • Similar Case Hints: View similar cases before submitting to aid learning
  • AI Tutor Feedback: Get educational explanations powered by Azure OpenAI (GPT-5.1)
  • Structured Radiology Reports: View CXRReportGen findings with bounding box visualizations
  • Similarity Search: Cases indexed using MedImageInsight embeddings for intelligent recommendations
  • Educational Focus: Designed for learning, not clinical use

Dataset

Source Data

The chest X-ray images are sourced from the NIH ChestX-ray14 dataset available on Kaggle:

  • Original Dataset: NIH Chest X-rays on Kaggle
  • Total Original Dataset: 112,120 frontal-view X-ray images from 30,805 unique patients

Our Dataset Subset

For this educational application, we created a curated subset:

  • Selected Images: ~4,300 images randomly sampled from the full dataset
  • Selection Strategy: Stratified sampling to ensure adequate representation of each medical condition
  • Conditions Covered: All 14 pathology labels from the NIH dataset:
    • Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema
    • Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening
    • Pneumonia, Pneumothorax
  • Normal Cases: Included to provide balanced learning scenarios

Data Processing Pipeline

  1. Image Selection & Upload (one-time manual process):

    • Downloaded ~4,300 images from Kaggle dataset
    • Ensured balanced representation across all 14 conditions
    • Uploaded images to Azure Blob Storage for centralized access
    • Created index CSV (nih_demo_index.csv) with metadata and labels
  2. AI Model Processing (scripts/build_index.py):

    • Generated MedImageInsight embeddings (1024-dimensional vectors) for each image
    • Generated CXRReportGen structured reports with findings and bounding boxes
    • Built Nearest Neighbor index using scikit-learn for similarity search
    • Saved all artifacts to data/ directory for application use
  3. Deployment (scripts/download_data.py):

    • Pre-processed data artifacts are stored in Azure Blob Storage
    • Downloaded automatically during application startup
    • No runtime AI model calls needed (all pre-computed)

Architecture

Technology Stack

  • Backend: FastAPI (Python 3.10+)
  • Frontend: Vanilla JavaScript, HTML5, CSS3
  • AI Services: Azure OpenAI, MedImageInsight, CXRReportGen
  • Data Storage: Azure Blob Storage
  • Hosting: Azure App Service (Linux)

Project Structure

cxr-training-tool/
├── app/
│   ├── main.py                 # FastAPI application & API endpoints
│   ├── models.py               # Pydantic request/response models
│   ├── core/
│   │   └── config.py          # Configuration & constants
│   └── services/
│       ├── cases.py           # Case data access layer
│       └── chat.py            # AI tutor integration
├── scripts/
│   ├── download_data.py       # Download data from Azure Blob Storage
│   ├── data_utils.py          # Dataset wrapper utilities
│   └── build_index.py         # Build embeddings & NN index
├── static/
│   ├── index.html             # Frontend UI
│   └── app.js                 # Frontend logic
├── data/                      # Data files (not in git)
│   ├── embeddings_complete.npy
│   ├── metadata_complete.parquet
│   ├── cxr_reports_complete.json
│   └── nn_index_complete.pkl
├── startup.sh                 # Azure App Service startup script
├── requirements.txt           # Python dependencies
└── README.md

Setup & Installation

Prerequisites

  • Python 3.10 or higher
  • Azure account with access to:
    • Azure Blob Storage (for CXR images and data)
    • Azure OpenAI (for GPT-5.1)
    • MedImageInsight endpoint
    • CXRReportGen endpoint

Local Development Setup

  1. Clone the repository

    git clone <repository-url>
    cd cxr-training-tool
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment variables

    Create a .env file in the project root:

    # Azure Storage
    AZURE_STORAGE_ACCOUNT_URL=https://<account>.blob.core.windows.net
    AZURE_STORAGE_CONTAINER=nih-cxr
    
    # MedImageInsight (optional - for building embeddings)
    MEDIMAGEINSIGHT_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/
    MEDIMAGEINSIGHT_KEY=<your-key>
    
    # CXRReportGen (optional - for generating reports)
    CXRREPORTGEN_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/
    CXRREPORTGEN_KEY=<your-key>
    
    # Chat Model (Azure OpenAI)
    CHAT_ENDPOINT=https://<resource>.openai.azure.com/openai/deployments/<deployment>/
    CHAT_API_KEY=<your-api-key>
    CHAT_MODEL_NAME=gpt-5.1
    
    # Data Directory (optional - defaults to 'data')
    DATA_DIR=data
  5. Download data files

    python scripts/download_data.py
  6. Run the application

    uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
  7. Open in browser

    http://localhost:8000
    

Configuration

Application Constants

Configure in app/core/config.py:

  • DEFAULT_SIMILAR_CASES = 6 - Number of similar cases shown by default
  • MAX_SIMILAR_CASES = 20 - Maximum similar cases allowed per request
  • MAX_EXPLANATION_TOKENS = 300 - Max tokens for AI explanations
  • MAX_REASONING_LENGTH = 2000 - Max length for student reasoning text
  • IMAGE_CACHE_MAX_AGE = 3600 - Browser cache duration for images (seconds)

Environment Variables

Variable Required Description
AZURE_STORAGE_ACCOUNT_URL Yes Azure Blob Storage account URL
AZURE_STORAGE_CONTAINER Yes Container name for CXR images
CHAT_ENDPOINT Yes Azure OpenAI endpoint URL
CHAT_API_KEY Yes Azure OpenAI API key
CHAT_MODEL_NAME Yes Model name (default: gpt-5.1)
DATA_DIR No Data directory path (default: data)
MEDIMAGEINSIGHT_ENDPOINT No Only needed for building embeddings
MEDIMAGEINSIGHT_KEY No Only needed for building embeddings
CXRREPORTGEN_ENDPOINT No Only needed for generating reports
CXRREPORTGEN_KEY No Only needed for generating reports

API Endpoints

Health Check

  • GET / - Redirects to frontend
  • GET /api/health - Health check with dataset stats

Case Management

  • GET /api/case/random - Get a random CXR case
  • GET /api/case/{image_id}/hint?k=6 - Get similar cases (hint mode)
  • POST /api/case/{image_id}/submit - Submit diagnosis and get feedback

Image Proxy

  • GET /api/image/{image_id} - Proxy endpoint for CXR images

API Documentation

  • GET /docs - Interactive Swagger UI documentation
  • GET /redoc - ReDoc documentation

Testing

Run unit tests:

# Test services
python tests/dev/test_cases_service.py
python tests/dev/test_chat_service.py

# Test API endpoints (requires running server)
python tests/dev/test_api_endpoints.py

Data Pipeline & Scripts

The application uses pre-generated data files. Here's how the data processing works:

Available Scripts

1. scripts/build_index.py - Main Data Processing Pipeline

Purpose: Process raw images and generate all AI-powered artifacts

What it does:

  • Reads data/nih_demo_index.csv (index of selected images with metadata)
  • Downloads each image from Azure Blob Storage
  • Calls MedImageInsight API to generate 1024-dimensional embeddings
  • Calls CXRReportGen API to generate structured radiology reports with bounding boxes
  • Builds scikit-learn NearestNeighbors index for fast similarity search
  • Saves results periodically (every 50 images) to prevent data loss
  • Logs progress to logs/build_index_<timestamp>.log

Output files (saved to data/):

  • embeddings_complete.npy - NumPy array of image embeddings (4,300 × 1024)
  • metadata_complete.parquet - Parquet file with image metadata and labels
  • cxr_reports_complete.json - JSON with structured radiology reports
  • nn_index_complete.pkl - Joblib-serialized nearest neighbor search index

Usage:

python scripts/build_index.py

Requirements:

  • MEDIMAGEINSIGHT_ENDPOINT and MEDIMAGEINSIGHT_KEY environment variables
  • CXRREPORTGEN_ENDPOINT and CXRREPORTGEN_KEY environment variables
  • data/nih_demo_index.csv input file
  • Images already uploaded to Azure Blob Storage

Processing time: ~2-4 hours for 4,300 images (depending on API response times)

2. scripts/download_data.py - Deployment Data Downloader

Purpose: Download pre-processed data artifacts during application startup

What it does:

  • Downloads the 4 output files from Azure Blob Storage
  • Skips files that already exist locally
  • Used by startup.sh during Azure App Service deployment
  • Much faster than regenerating (~2-5 minutes vs hours)

Usage:

python scripts/download_data.py

3. scripts/data_utils.py - Data Access Utilities

Purpose: Provides clean Python interface to load and query the processed data

What it provides:

  • CXRDataset class - unified interface to all data files
  • Validates data alignment (embeddings, metadata, reports all match)
  • Methods: get_metadata(), get_embedding(), get_report(), find_similar()
  • Used by app/services/cases.py in the main application

Rebuilding from Scratch

If you need to regenerate the data pipeline:

  1. Prepare Input CSV (data/nih_demo_index.csv):

    Image_ID,Image_Filename,Azure_Path,Binary_Label,Main_Labels,Multi_Class,Condition_Count,Source
    1,00000001_000.png,images/00000001_000.png,Abnormal,Cardiomegaly,Cardiomegaly,1,NIH
    2,00000002_001.png,images/00000002_001.png,Normal,,No Finding,0,NIH
    ...
    
  2. Upload Images to Azure Blob Storage:

    • Container: As specified in AZURE_STORAGE_CONTAINER
    • Path: Match the Azure_Path column in your CSV
  3. Configure API Keys:

    export MEDIMAGEINSIGHT_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
    export MEDIMAGEINSIGHT_KEY="<your-key>"
    export CXRREPORTGEN_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
    export CXRREPORTGEN_KEY="<your-key>"
  4. Run Pipeline:

    python scripts/build_index.py
  5. Upload Results to Blob Storage (for deployment):

    # Upload the 4 generated files to Azure Blob Storage under app-data/ prefix
    az storage blob upload --account-name <account> --container-name <container> \
      --name app-data/embeddings_complete.npy --file data/embeddings_complete.npy
    # ... repeat for other files

What's NOT in the Scripts

The following one-time manual processes were performed but scripts are not included:

Image Selection Script:

  • Downloaded 4,300+ images from Kaggle NIH ChestX-ray14 dataset
  • Implemented stratified sampling to ensure balanced condition representation
  • Uploaded selected images to Azure Blob Storage

Index CSV Creation:

  • Parsed original NIH dataset metadata
  • Matched image filenames to labels
  • Created nih_demo_index.csv with aligned metadata

Educational Use Only

Important Disclaimer

This application is designed for educational purposes only. It should not be used for:

  • Clinical diagnosis or treatment decisions
  • Patient care or medical advice
  • Any situation requiring FDA-approved medical devices
  • Processing actual protected health information (PHI)

The AI-generated explanations and reports are for teaching pattern recognition in radiology, not clinical decision support.

Acknowledgments

  • NIH ChestX-ray14 Dataset: Chest X-ray images from NIH Clinical Center
  • MedImageInsight: Medical image embedding model
  • CXRReportGen: Structured chest X-ray report generation
  • Azure OpenAI: GPT-5.1 for educational feedback

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published