# Medical Document Classifier

This notebook demonstrates how to build a medical document classifier using the Hugging Face transformers library. We'll classify medical texts into different categories like:
- Clinical Notes
- Research Papers
- Patient Records
- Medical Reports
- Prescriptions

In [1]:
from transformers import pipeline
from datasets import load_dataset
import pandas as pd

In [2]:
# Load the medical cases classification dataset
dataset = load_dataset("hpe-ai/medical-cases-classification-tutorial")

# Display basic information about the dataset
print(f"Dataset keys: {list(dataset.keys())}")
print(f"Train dataset size: {len(dataset['train'])}")

# Show the structure of the dataset
print("\nDataset features:")
print(dataset['train'].features)

# Display first few examples
print("\nFirst 3 examples:")
for i in range(3):
    example = dataset['train'][i]
    print(f"\nExample {i+1}:")
    for key, value in example.items():
        if isinstance(value, str) and len(value) > 200:
            print(f"{key}: {value[:200]}...")
        else:
            print(f"{key}: {value}")

README.md:   0%|          | 0.00/192 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


medical_cases_train.csv: 0.00B [00:00, ?B/s]

medical_cases_validation.csv: 0.00B [00:00, ?B/s]

medical_cases_test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1724 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/370 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/370 [00:00<?, ? examples/s]

Dataset keys: ['train', 'validation', 'test']
Train dataset size: 1724

Dataset features:
{'description': Value('string'), 'transcription': Value('string'), 'sample_name': Value('string'), 'medical_specialty': Value('string'), 'keywords': Value('string')}

First 3 examples:

Example 1:
description: Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia.
transcription: PROCEDURE NOTE: , Pacemaker ICD interrogation.,HISTORY OF PRESENT ILLNESS: , The patient is a 67-year-old gentleman who was admitted to the hospital.  He has had ICD pacemaker implantation.  This is a...
sample_name: Pacemaker Interrogation
medical_specialty: Cardiovascular / Pulmonary
keywords: cardiovascular / pulmonary, cardiomyopathy, ventricular, tachycardia, pacemaker icd interrogation, millivolts, impendence, interrogation, pacemaker,

Example 2:
description: Erythema of the right knee and leg, possible septic knee. Aspiration through the anterolateral portal of knee join

In [3]:
# Since the dataset already has medical specialties, let's use that for classification
# We'll demonstrate with a subset and show how to use the actual medical specialty labels

# Let's work with a sample of the dataset for demonstration
sample_size = 10
train_sample = dataset['train'].select(range(sample_size))

# Extract the medical texts and their actual labels
medical_texts = []
actual_specialties = []
sample_names = []

for i in range(sample_size):
    description = train_sample[i]['description']
    transcription = train_sample[i]['transcription'][:300]  # First 300 chars of transcription
    combined_text = f"{description}. {transcription}"
    
    medical_texts.append(combined_text)
    actual_specialties.append(train_sample[i]['medical_specialty'])
    sample_names.append(train_sample[i]['sample_name'])

print(f"Selected {len(medical_texts)} medical texts for analysis")
print(f"Medical specialties found: {set(actual_specialties)}")
print(f"\nExample text: {medical_texts[0][:200]}...")
print(f"Actual specialty: {actual_specialties[0]}")
print(f"Sample name: {sample_names[0]}")

Selected 10 medical texts for analysis
Medical specialties found: {'Neurology', 'Orthopedic', 'Nephrology', 'ENT - Otolaryngology', 'Obstetrics / Gynecology', 'Cardiovascular / Pulmonary', 'Ophthalmology', 'Gastroenterology'}

Example text: Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia.. PROCEDURE NOTE: , Pacemaker ICD interrogation.,HISTORY OF PRESENT ILLNESS: , The patient is a 67-ye...
Actual specialty: Cardiovascular / Pulmonary
Sample name: Pacemaker Interrogation


In [5]:
# Display the results in a structured format
print("Medical Document Classification Results")
print("=" * 60)

for idx in range(len(medical_texts)):
    keywords = train_sample[idx]['keywords']
    keywords_display = keywords[:100] + "..." if keywords and len(keywords) > 100 else (keywords or "No keywords available")
    
    print(f"\nDocument {idx + 1}: {sample_names[idx]}")
    print(f"Text: {medical_texts[idx][:150]}...")
    print(f"Actual Medical Specialty: {actual_specialties[idx]}")
    print(f"Keywords: {keywords_display}")
    print("-" * 40)

# Show distribution of medical specialties in our sample
specialty_counts = {}
for specialty in actual_specialties:
    specialty_counts[specialty] = specialty_counts.get(specialty, 0) + 1

print(f"\nMedical Specialty Distribution in Sample:")
for specialty, count in specialty_counts.items():
    print(f"- {specialty}: {count} cases")

# Show overall dataset statistics
print(f"\nDataset Overview:")
print(f"- Total training examples: {len(dataset['train'])}")
print(f"- Validation examples: {len(dataset['validation'])}")  
print(f"- Test examples: {len(dataset['test'])}")

# Get unique medical specialties in the full dataset
all_specialties = set(dataset['train']['medical_specialty'])
print(f"\nAll Medical Specialties in Dataset ({len(all_specialties)} total):")
for specialty in sorted(all_specialties):
    count = dataset['train']['medical_specialty'].count(specialty)
    print(f"- {specialty}: {count} cases")

Medical Document Classification Results

Document 1: Pacemaker Interrogation
Text: Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia.. PROCEDURE NOTE: , Pacemaker ICD interrogation.,HI...
Actual Medical Specialty: Cardiovascular / Pulmonary
Keywords: cardiovascular / pulmonary, cardiomyopathy, ventricular, tachycardia, pacemaker icd interrogation, m...
----------------------------------------

Document 2: Aspiration - Knee Joint
Text: Erythema of the right knee and leg, possible septic knee. Aspiration through the anterolateral portal of knee joint.. PREOPERATIVE DIAGNOSES: , Erythe...
Actual Medical Specialty: Orthopedic
Keywords: orthopedic, knee and leg, anterolateral portal, emergency department, spinal needle, septic knee, kn...
----------------------------------------

Document 3: Cardiac Cath & Selective Coronary Angiography
Text: Left cardiac catheterization with selective right and left coronary angiography.   Post infarct angin

In [7]:
# Let's demonstrate actual medical document classification using a fine-tuned model
# For this example, we'll use a clinical BERT model that's better suited for medical texts

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

# Try to use a medical-domain specific model
try:
    # Use a clinical BERT model if available
    model_name = "emilyalsentzer/Bio_ClinicalBERT"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Create a zero-shot classification pipeline for medical specialties
    classifier = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli"
    )
    
    # Define the medical specialties from our dataset as candidate labels
    medical_specialties = [
        "Cardiovascular Pulmonary",
        "Orthopedic", 
        "Nephrology",
        "ENT Otolaryngology",
        "Obstetrics Gynecology",
        "Ophthalmology",
        "Gastroenterology",
        "Neurology",
        "Radiology",
        "Psychiatry Psychology",
        "Pediatrics Neonatal",
        "Hematology Oncology",
        "Neurosurgery"
    ]
    
    # Test classification on first 3 samples
    print("Zero-Shot Medical Document Classification Results")
    print("=" * 60)
    
    for i in range(3):
        text_to_classify = medical_texts[i]
        actual_specialty = actual_specialties[i]
        
        # Perform zero-shot classification
        result = classifier(text_to_classify, medical_specialties)
        
        print(f"\nDocument {i+1}: {sample_names[i]}")
        print(f"Text: {text_to_classify[:150]}...")
        print(f"Actual Specialty: {actual_specialty}")
        print(f"Predicted Specialty: {result['labels'][0]}")
        print(f"Confidence: {result['scores'][0]:.3f}")
        
        # Show top 3 predictions
        print("Top 3 predictions:")
        for j in range(min(3, len(result['labels']))):
            print(f"  {j+1}. {result['labels'][j]}: {result['scores'][j]:.3f}")
        print("-" * 50)
        
    print("\nNote: This is a zero-shot classification example.")
    print("For better accuracy, you would typically fine-tune a model")
    print("specifically on the medical cases classification dataset.")
    
except Exception as e:
    print(f"Note: Advanced classification model not available: {e}")
    print("The dataset is ready for use with any medical text classification model.")
    print("You can fine-tune models like Bio_ClinicalBERT, ClinicalBERT, or other medical domain models.")

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Zero-Shot Medical Document Classification Results

Document 1: Pacemaker Interrogation
Text: Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia.. PROCEDURE NOTE: , Pacemaker ICD interrogation.,HI...
Actual Specialty: Cardiovascular / Pulmonary
Predicted Specialty: Radiology
Confidence: 0.309
Top 3 predictions:
  1. Radiology: 0.309
  2. Cardiovascular Pulmonary: 0.284
  3. Nephrology: 0.086
--------------------------------------------------

Document 1: Pacemaker Interrogation
Text: Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia.. PROCEDURE NOTE: , Pacemaker ICD interrogation.,HI...
Actual Specialty: Cardiovascular / Pulmonary
Predicted Specialty: Radiology
Confidence: 0.309
Top 3 predictions:
  1. Radiology: 0.309
  2. Cardiovascular Pulmonary: 0.284
  3. Nephrology: 0.086
--------------------------------------------------

Document 2: Aspiration - Knee Joint
Text: Erythema of the right