An interactive chest X-ray (CXR) teaching application for medical imaging education. This web application provides medical students and healthcare professionals with an AI-powered platform to practice interpreting chest X-rays with immediate feedback.
- Random Case Selection: Practice with ~4,300 real chest X-ray images from the NIH ChestX-ray14 dataset
- Similar Case Hints: View similar cases before submitting to aid learning
- AI Tutor Feedback: Get educational explanations powered by Azure OpenAI (GPT-5.1)
- Structured Radiology Reports: View CXRReportGen findings with bounding box visualizations
- Similarity Search: Cases indexed using MedImageInsight embeddings for intelligent recommendations
- Educational Focus: Designed for learning, not clinical use
The chest X-ray images are sourced from the NIH ChestX-ray14 dataset available on Kaggle:
- Original Dataset: NIH Chest X-rays on Kaggle
- Total Original Dataset: 112,120 frontal-view X-ray images from 30,805 unique patients
For this educational application, we created a curated subset:
- Selected Images: ~4,300 images randomly sampled from the full dataset
- Selection Strategy: Stratified sampling to ensure adequate representation of each medical condition
- Conditions Covered: All 14 pathology labels from the NIH dataset:
- Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema
- Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening
- Pneumonia, Pneumothorax
- Normal Cases: Included to provide balanced learning scenarios
-
Image Selection & Upload (one-time manual process):
- Downloaded ~4,300 images from Kaggle dataset
- Ensured balanced representation across all 14 conditions
- Uploaded images to Azure Blob Storage for centralized access
- Created index CSV (
nih_demo_index.csv) with metadata and labels
-
AI Model Processing (
scripts/build_index.py):- Generated MedImageInsight embeddings (1024-dimensional vectors) for each image
- Generated CXRReportGen structured reports with findings and bounding boxes
- Built Nearest Neighbor index using scikit-learn for similarity search
- Saved all artifacts to
data/directory for application use
-
Deployment (
scripts/download_data.py):- Pre-processed data artifacts are stored in Azure Blob Storage
- Downloaded automatically during application startup
- No runtime AI model calls needed (all pre-computed)
- Backend: FastAPI (Python 3.10+)
- Frontend: Vanilla JavaScript, HTML5, CSS3
- AI Services: Azure OpenAI, MedImageInsight, CXRReportGen
- Data Storage: Azure Blob Storage
- Hosting: Azure App Service (Linux)
cxr-training-tool/
├── app/
│ ├── main.py # FastAPI application & API endpoints
│ ├── models.py # Pydantic request/response models
│ ├── core/
│ │ └── config.py # Configuration & constants
│ └── services/
│ ├── cases.py # Case data access layer
│ └── chat.py # AI tutor integration
├── scripts/
│ ├── download_data.py # Download data from Azure Blob Storage
│ ├── data_utils.py # Dataset wrapper utilities
│ └── build_index.py # Build embeddings & NN index
├── static/
│ ├── index.html # Frontend UI
│ └── app.js # Frontend logic
├── data/ # Data files (not in git)
│ ├── embeddings_complete.npy
│ ├── metadata_complete.parquet
│ ├── cxr_reports_complete.json
│ └── nn_index_complete.pkl
├── startup.sh # Azure App Service startup script
├── requirements.txt # Python dependencies
└── README.md
- Python 3.10 or higher
- Azure account with access to:
- Azure Blob Storage (for CXR images and data)
- Azure OpenAI (for GPT-5.1)
- MedImageInsight endpoint
- CXRReportGen endpoint
-
Clone the repository
git clone <repository-url> cd cxr-training-tool
-
Create virtual environment
python -m venv venv source venv/bin/activate -
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create a
.envfile in the project root:# Azure Storage AZURE_STORAGE_ACCOUNT_URL=https://<account>.blob.core.windows.net AZURE_STORAGE_CONTAINER=nih-cxr # MedImageInsight (optional - for building embeddings) MEDIMAGEINSIGHT_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/ MEDIMAGEINSIGHT_KEY=<your-key> # CXRReportGen (optional - for generating reports) CXRREPORTGEN_ENDPOINT=https://<endpoint>.cognitiveservices.azure.com/ CXRREPORTGEN_KEY=<your-key> # Chat Model (Azure OpenAI) CHAT_ENDPOINT=https://<resource>.openai.azure.com/openai/deployments/<deployment>/ CHAT_API_KEY=<your-api-key> CHAT_MODEL_NAME=gpt-5.1 # Data Directory (optional - defaults to 'data') DATA_DIR=data
-
Download data files
python scripts/download_data.py
-
Run the application
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
-
Open in browser
http://localhost:8000
Configure in app/core/config.py:
DEFAULT_SIMILAR_CASES = 6- Number of similar cases shown by defaultMAX_SIMILAR_CASES = 20- Maximum similar cases allowed per requestMAX_EXPLANATION_TOKENS = 300- Max tokens for AI explanationsMAX_REASONING_LENGTH = 2000- Max length for student reasoning textIMAGE_CACHE_MAX_AGE = 3600- Browser cache duration for images (seconds)
| Variable | Required | Description |
|---|---|---|
AZURE_STORAGE_ACCOUNT_URL |
Yes | Azure Blob Storage account URL |
AZURE_STORAGE_CONTAINER |
Yes | Container name for CXR images |
CHAT_ENDPOINT |
Yes | Azure OpenAI endpoint URL |
CHAT_API_KEY |
Yes | Azure OpenAI API key |
CHAT_MODEL_NAME |
Yes | Model name (default: gpt-5.1) |
DATA_DIR |
No | Data directory path (default: data) |
MEDIMAGEINSIGHT_ENDPOINT |
No | Only needed for building embeddings |
MEDIMAGEINSIGHT_KEY |
No | Only needed for building embeddings |
CXRREPORTGEN_ENDPOINT |
No | Only needed for generating reports |
CXRREPORTGEN_KEY |
No | Only needed for generating reports |
GET /- Redirects to frontendGET /api/health- Health check with dataset stats
GET /api/case/random- Get a random CXR caseGET /api/case/{image_id}/hint?k=6- Get similar cases (hint mode)POST /api/case/{image_id}/submit- Submit diagnosis and get feedback
GET /api/image/{image_id}- Proxy endpoint for CXR images
GET /docs- Interactive Swagger UI documentationGET /redoc- ReDoc documentation
Run unit tests:
# Test services
python tests/dev/test_cases_service.py
python tests/dev/test_chat_service.py
# Test API endpoints (requires running server)
python tests/dev/test_api_endpoints.pyThe application uses pre-generated data files. Here's how the data processing works:
Purpose: Process raw images and generate all AI-powered artifacts
What it does:
- Reads
data/nih_demo_index.csv(index of selected images with metadata) - Downloads each image from Azure Blob Storage
- Calls MedImageInsight API to generate 1024-dimensional embeddings
- Calls CXRReportGen API to generate structured radiology reports with bounding boxes
- Builds scikit-learn NearestNeighbors index for fast similarity search
- Saves results periodically (every 50 images) to prevent data loss
- Logs progress to
logs/build_index_<timestamp>.log
Output files (saved to data/):
embeddings_complete.npy- NumPy array of image embeddings (4,300 × 1024)metadata_complete.parquet- Parquet file with image metadata and labelscxr_reports_complete.json- JSON with structured radiology reportsnn_index_complete.pkl- Joblib-serialized nearest neighbor search index
Usage:
python scripts/build_index.pyRequirements:
MEDIMAGEINSIGHT_ENDPOINTandMEDIMAGEINSIGHT_KEYenvironment variablesCXRREPORTGEN_ENDPOINTandCXRREPORTGEN_KEYenvironment variablesdata/nih_demo_index.csvinput file- Images already uploaded to Azure Blob Storage
Processing time: ~2-4 hours for 4,300 images (depending on API response times)
Purpose: Download pre-processed data artifacts during application startup
What it does:
- Downloads the 4 output files from Azure Blob Storage
- Skips files that already exist locally
- Used by
startup.shduring Azure App Service deployment - Much faster than regenerating (~2-5 minutes vs hours)
Usage:
python scripts/download_data.pyPurpose: Provides clean Python interface to load and query the processed data
What it provides:
CXRDatasetclass - unified interface to all data files- Validates data alignment (embeddings, metadata, reports all match)
- Methods:
get_metadata(),get_embedding(),get_report(),find_similar() - Used by
app/services/cases.pyin the main application
If you need to regenerate the data pipeline:
-
Prepare Input CSV (
data/nih_demo_index.csv):Image_ID,Image_Filename,Azure_Path,Binary_Label,Main_Labels,Multi_Class,Condition_Count,Source 1,00000001_000.png,images/00000001_000.png,Abnormal,Cardiomegaly,Cardiomegaly,1,NIH 2,00000002_001.png,images/00000002_001.png,Normal,,No Finding,0,NIH ... -
Upload Images to Azure Blob Storage:
- Container: As specified in
AZURE_STORAGE_CONTAINER - Path: Match the
Azure_Pathcolumn in your CSV
- Container: As specified in
-
Configure API Keys:
export MEDIMAGEINSIGHT_ENDPOINT="https://<resource>.cognitiveservices.azure.com/" export MEDIMAGEINSIGHT_KEY="<your-key>" export CXRREPORTGEN_ENDPOINT="https://<resource>.cognitiveservices.azure.com/" export CXRREPORTGEN_KEY="<your-key>"
-
Run Pipeline:
python scripts/build_index.py
-
Upload Results to Blob Storage (for deployment):
# Upload the 4 generated files to Azure Blob Storage under app-data/ prefix az storage blob upload --account-name <account> --container-name <container> \ --name app-data/embeddings_complete.npy --file data/embeddings_complete.npy # ... repeat for other files
The following one-time manual processes were performed but scripts are not included:
Image Selection Script:
- Downloaded 4,300+ images from Kaggle NIH ChestX-ray14 dataset
- Implemented stratified sampling to ensure balanced condition representation
- Uploaded selected images to Azure Blob Storage
Index CSV Creation:
- Parsed original NIH dataset metadata
- Matched image filenames to labels
- Created
nih_demo_index.csvwith aligned metadata
Important Disclaimer
This application is designed for educational purposes only. It should not be used for:
- Clinical diagnosis or treatment decisions
- Patient care or medical advice
- Any situation requiring FDA-approved medical devices
- Processing actual protected health information (PHI)
The AI-generated explanations and reports are for teaching pattern recognition in radiology, not clinical decision support.
- NIH ChestX-ray14 Dataset: Chest X-ray images from NIH Clinical Center
- MedImageInsight: Medical image embedding model
- CXRReportGen: Structured chest X-ray report generation
- Azure OpenAI: GPT-5.1 for educational feedback