## Setup

First, let's import the required libraries and set up the environment.

In [None]:
# Standard library imports
import sys
from pathlib import Path

# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Scientific computing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', 15)
%matplotlib inline

print("‚úì Libraries imported successfully!")
print(f"  NumPy version: {np.__version__}")
print(f"  Pandas version: {pd.__version__}")

---
## 1. üìÅ DICOM File Loading

DICOM (Digital Imaging and Communications in Medicine) is the standard format for medical images like X-rays, CT scans, and MRIs.

In [None]:
# Import our DICOM loader
from ingestion import DICOMLoader, DICOMValidator, MetadataExtractor

# Initialize the loader
loader = DICOMLoader(
    source_type="local",
    batch_size=100,
    supported_modalities=["CT", "MR", "CR", "DX"]  # Common imaging types
)

print("DICOMLoader Configuration:")
print(f"  Source type: {loader.source_type}")
print(f"  Batch size: {loader.batch_size}")
print(f"  Supported modalities: {loader.supported_modalities}")

In [None]:
# Check for sample DICOM files
data_dir = Path.cwd().parent / "data" / "dicom"

if data_dir.exists():
    dcm_files = list(data_dir.rglob("*.dcm"))
    print(f"Found {len(dcm_files)} DICOM files in {data_dir}")
    
    if dcm_files:
        # Load files
        results = loader.load_directory(data_dir)
        print(f"\nLoaded {len(results)} files successfully")
        
        # Show statistics
        stats = loader.get_statistics()
        print(f"\nStatistics: {stats}")
else:
    print(f"No data directory found at {data_dir}")
    print("\nTo test with real DICOM files:")
    print("1. Download sample files from pydicom or NIH datasets")
    print("2. Place them in data/dicom/")

### Understanding DICOM Metadata

DICOM files contain rich metadata about the patient, study, and image:

In [None]:
# Key DICOM metadata fields
metadata_info = {
    "Patient Information": [
        "PatientID - Unique patient identifier",
        "PatientName - Patient's name (PHI - must be anonymized!)",
        "PatientBirthDate - Date of birth (PHI)",
        "PatientSex - M/F/O"
    ],
    "Study Information": [
        "StudyDate - Date of the imaging study",
        "StudyDescription - Description of the study",
        "Modality - CT, MR, CR (X-ray), DX (Digital X-ray)",
        "BodyPartExamined - Chest, Head, etc."
    ],
    "Image Information": [
        "Rows - Image height in pixels",
        "Columns - Image width in pixels",
        "PixelSpacing - Physical size of pixels (mm)",
        "WindowCenter/Width - Display settings"
    ]
}

for category, fields in metadata_info.items():
    print(f"\n{category}:")
    for field in fields:
        print(f"  ‚Ä¢ {field}")

---
## 2. üîí Data Anonymization

Medical data contains **Protected Health Information (PHI)** that must be removed for research use. This is required by HIPAA regulations.

In [None]:
from ingestion import Anonymizer

# Initialize anonymizer with strict settings
anonymizer = Anonymizer(
    anonymization_level="strict",  # Remove all PHI
    date_shift_days=30  # Shift dates to preserve temporal relationships
)

print("Anonymizer Configuration:")
print(f"  Level: {anonymizer.anonymization_level}")
print(f"  Date shift: {anonymizer.date_shift_days} days")
print(f"  PHI tags tracked: {len(anonymizer.PHI_TAGS)}")

In [None]:
# Demonstrate patient ID hashing
print("Patient ID Anonymization (using secure hashing):")
print("-" * 50)

test_patient_ids = ["JOHN_DOE_123", "JANE_SMITH_456", "JOHN_DOE_123"]  # Note duplicate

for original_id in test_patient_ids:
    anonymized_id = anonymizer.hash_patient_id(original_id)
    print(f"  {original_id:20} ‚Üí {anonymized_id}")

print("\nüí° Notice: Same patient ID always produces the same hash!")
print("   This allows linking records while protecting identity.")

In [None]:
# Demonstrate date shifting
print("Date Shifting (preserves temporal relationships):")
print("-" * 50)

test_dates = ["2024-01-15", "2024-01-20", "2024-02-01"]

print(f"  Shift: +{anonymizer.date_shift_days} days\n")
for date in test_dates:
    shifted = anonymizer._shift_date_string(date)
    print(f"  {date} ‚Üí {shifted}")

---
## 3. üñºÔ∏è Image Preprocessing

Medical images need preprocessing before analysis:
- **Windowing** - Adjusting contrast for visualization
- **Resizing** - Standardizing dimensions
- **Normalization** - Scaling pixel values

In [None]:
from preprocessing import ImagePreprocessor

# Initialize preprocessor
preprocessor = ImagePreprocessor(
    target_size=(224, 224),  # Standard size for neural networks
    normalize_method='zero_one',  # Scale to [0, 1]
    augmentation=False,
    random_seed=42
)

print("ImagePreprocessor Configuration:")
print(f"  Target size: {preprocessor.target_size}")
print(f"  Normalization: {preprocessor.normalize_method}")
print(f"  Augmentation: {preprocessor.augmentation}")

In [None]:
# Create a synthetic image to demonstrate preprocessing
# (In real use, this would be loaded from a DICOM file)

# Simulate a chest X-ray like image (512x512, 12-bit values)
np.random.seed(42)
synthetic_image = np.random.normal(1500, 500, (512, 512)).astype(np.float32)

# Add some structure (simulate lung regions)
y, x = np.ogrid[:512, :512]
center_mask = ((x - 256)**2 + (y - 256)**2) < 200**2
synthetic_image[center_mask] -= 800  # Lungs appear darker

print("Synthetic Image Created:")
print(f"  Shape: {synthetic_image.shape}")
print(f"  Value range: [{synthetic_image.min():.0f}, {synthetic_image.max():.0f}]")
print(f"  Dtype: {synthetic_image.dtype}")

In [None]:
# Apply preprocessing steps
print("Preprocessing Pipeline:")
print("=" * 50)

# Step 1: Windowing
windowed = preprocessor.apply_windowing(synthetic_image)
print(f"\n1. Windowing:")
print(f"   Input range: [{synthetic_image.min():.0f}, {synthetic_image.max():.0f}]")
print(f"   Output range: [{windowed.min()}, {windowed.max()}]")

# Step 2: Resize
resized = preprocessor.resize_image(windowed)
print(f"\n2. Resizing:")
print(f"   Input shape: {windowed.shape}")
print(f"   Output shape: {resized.shape}")

# Step 3: Normalize
normalized = preprocessor.normalize_image(resized)
print(f"\n3. Normalization ({preprocessor.normalize_method}):")
print(f"   Output range: [{normalized.min():.4f}, {normalized.max():.4f}]")

In [None]:
# Visualize the preprocessing steps
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Original
axes[0].imshow(synthetic_image, cmap='gray')
axes[0].set_title(f'Original\n{synthetic_image.shape}')
axes[0].axis('off')

# Windowed
axes[1].imshow(windowed, cmap='gray')
axes[1].set_title(f'Windowed\n[0-255]')
axes[1].axis('off')

# Resized
axes[2].imshow(resized, cmap='gray')
axes[2].set_title(f'Resized\n{resized.shape}')
axes[2].axis('off')

# Normalized
axes[3].imshow(normalized, cmap='gray')
axes[3].set_title(f'Normalized\n[0-1]')
axes[3].axis('off')

plt.suptitle('Image Preprocessing Pipeline', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Dataset Splitting

For machine learning, we need to split data into train/validation/test sets:

In [None]:
# Create sample dataset
n_samples = 100
sample_paths = [f"patient_{i:03d}/image.dcm" for i in range(n_samples)]
sample_labels = [0] * 60 + [1] * 40  # 60% class 0, 40% class 1

# Split the dataset
splits = preprocessor.create_dataset_split(
    sample_paths, 
    sample_labels,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    stratify=True  # Maintain class proportions
)

# Display results
print("Dataset Split Results:")
print("=" * 50)

split_data = []
for split_name, data in splits.items():
    labels = data['labels']
    total = len(labels)
    class_0 = labels.count(0)
    class_1 = labels.count(1)
    split_data.append({
        'Split': split_name.capitalize(),
        'Total': total,
        'Class 0': class_0,
        'Class 1': class_1,
        'Class 0 %': f"{class_0/total*100:.1f}%",
        'Class 1 %': f"{class_1/total*100:.1f}%"
    })

split_df = pd.DataFrame(split_data)
print(split_df.to_string(index=False))

print("\nüí° Note: Stratification ensures each split has similar class proportions!")

---
## 4. üß™ Blood Test Data Processing

Clinical lab data provides important context for medical imaging analysis. Let's use **pandas** to process blood test data.

In [None]:
from ingestion import BloodTestLoader
from ingestion.blood_test_loader import REFERENCE_RANGES

# Initialize the loader
lab_loader = BloodTestLoader(
    normalize_units=True,
    add_reference_ranges=True,
    validate_values=True
)

print("BloodTestLoader Configuration:")
print(f"  Normalize units: {lab_loader.normalize_units}")
print(f"  Add reference ranges: {lab_loader.add_reference_ranges}")
print(f"  Validate values: {lab_loader.validate_values}")

In [None]:
# Display reference ranges for common tests
print("Reference Ranges for Common Blood Tests:")
print("=" * 55)

ref_data = []
for test_name, values in list(REFERENCE_RANGES.items())[:10]:
    ref_data.append({
        'Test': test_name,
        'Min': values['min'],
        'Max': values['max'],
        'Unit': values['unit']
    })

ref_df = pd.DataFrame(ref_data)
print(ref_df.to_string(index=False))

In [None]:
# Create sample blood test data (simulating hospital lab results)
np.random.seed(42)

# Generate realistic lab data for multiple patients
patients = ['P001', 'P002', 'P003', 'P004', 'P005']
tests = ['WBC', 'Hemoglobin', 'Glucose', 'Creatinine', 'CRP']

lab_records = []
for patient in patients:
    for test in tests:
        ref = REFERENCE_RANGES[test]
        # Generate values (some normal, some abnormal)
        if np.random.random() < 0.7:  # 70% normal
            value = np.random.uniform(ref['min'], ref['max'])
        else:  # 30% abnormal
            if np.random.random() < 0.5:
                value = ref['min'] * np.random.uniform(0.5, 0.9)  # Low
            else:
                value = ref['max'] * np.random.uniform(1.1, 1.5)  # High
        
        lab_records.append({
            'patient_id': patient,
            'lab_name': test,
            'value': round(value, 2),
            'unit': ref['unit'],
            'test_datetime': pd.Timestamp('2024-01-15') + pd.Timedelta(days=np.random.randint(0, 30))
        })

# Create DataFrame
lab_df = pd.DataFrame(lab_records)
print(f"Created sample lab data: {len(lab_df)} records")
print(f"\nFirst 10 records:")
lab_df.head(10)

In [None]:
# Process the lab data
processed_labs = lab_loader.load_dataframe(lab_df)

print("Processed Lab Data:")
print("=" * 70)

# Show with validation results
display_cols = ['patient_id', 'lab_name', 'value', 'ref_min', 'ref_max', 'abnormal_flag']
processed_labs[display_cols].head(15)

In [None]:
# Analyze the results
print("Lab Value Analysis:")
print("=" * 50)

# Count by abnormal flag
flag_counts = processed_labs['abnormal_flag'].value_counts()
print("\nValue Distribution:")
for flag, count in flag_counts.items():
    pct = count / len(processed_labs) * 100
    bar = '‚ñà' * int(pct / 2)
    print(f"  {flag:8} : {count:3} ({pct:5.1f}%) {bar}")

# Patients with abnormal values
print("\nPatients with Abnormal Values:")
abnormal = processed_labs[processed_labs['is_abnormal'] == True]
patient_abnormal = abnormal.groupby('patient_id')['lab_name'].apply(list)
for patient, tests in patient_abnormal.items():
    print(f"  {patient}: {', '.join(tests)}")

In [None]:
# Visualize lab results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Value distribution by test
sns.boxplot(data=processed_labs, x='lab_name', y='value', ax=axes[0])
axes[0].set_title('Lab Value Distribution by Test', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Test Name')
axes[0].set_ylabel('Value')
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Abnormal flag distribution
flag_counts.plot(kind='pie', autopct='%1.1f%%', ax=axes[1], 
                 colors=['#2ecc71', '#e74c3c', '#f39c12'])
axes[1].set_title('Abnormal Value Distribution', fontsize=12, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

---
## 5. üìä Visualization

The visualization module creates charts and reports for analysis results.

In [None]:
from visualization import ResultsVisualizer

# Initialize visualizer
visualizer = ResultsVisualizer(
    figure_size=(12, 8),
    style='seaborn-v0_8-darkgrid',
    color_palette='Set2'
)

print("ResultsVisualizer initialized!")
print(f"  Figure size: {visualizer.figure_size}")

In [None]:
# Create sample prediction data (simulating model outputs)
np.random.seed(42)

sample_predictions = []
classes = ['Normal', 'Pneumonia', 'CHF']
class_weights = [0.5, 0.35, 0.15]  # Class distribution

for i in range(50):
    pred_class = np.random.choice(classes, p=class_weights)
    # Higher confidence for Normal, lower for others
    if pred_class == 'Normal':
        confidence = np.random.uniform(0.75, 0.98)
    else:
        confidence = np.random.uniform(0.55, 0.90)
    
    sample_predictions.append({
        'prediction': pred_class,
        'confidence': confidence,
        'study_date': (pd.Timestamp('2024-01-01') + pd.Timedelta(days=i*3)).strftime('%Y-%m-%d')
    })

print(f"Created {len(sample_predictions)} sample predictions")
pd.DataFrame(sample_predictions).head()

In [None]:
# Plot prediction distribution
fig = visualizer.plot_prediction_distribution(
    sample_predictions,
    show_plot=True
)

In [None]:
# Plot confidence distribution by class
fig = visualizer.plot_confidence_distribution(
    sample_predictions,
    by_class=True,
    show_plot=True
)

In [None]:
# Create a sample patient report
patient_report = {
    'patient_id': 'ANON_A1B2C3D4E5F6',
    'predictions': sample_predictions[:5],
    'summary': {
        'num_images': 5,
        'num_predictions': 5,
        'num_lab_tests': 10,
        'num_correlations': 3
    },
    'correlations': []
}

# Plot patient timeline
fig = visualizer.plot_patient_timeline(
    patient_report,
    show_plot=True
)

---
## Summary

This notebook demonstrated the core functionality of the Medical Imaging DICOM Processing Pipeline:

| Module | Purpose | Key Libraries |
|--------|---------|---------------|
| **Ingestion** | Load DICOM files, validate format | pydicom |
| **Anonymization** | Remove PHI, hash patient IDs | hashlib |
| **Preprocessing** | Resize, normalize, augment images | numpy, PIL |
| **Blood Tests** | Load and validate lab data | pandas |
| **Visualization** | Create charts and reports | matplotlib, seaborn |

### Python Concepts Demonstrated

- ‚úÖ Object-Oriented Programming (classes, methods, encapsulation)
- ‚úÖ NumPy array operations
- ‚úÖ Pandas DataFrames for data manipulation
- ‚úÖ Type hints and docstrings
- ‚úÖ File I/O and path handling
- ‚úÖ Matplotlib/Seaborn visualization
- ‚úÖ Configuration management
- ‚úÖ Logging

---
*Medical Imaging DICOM Processing Pipeline - Course Submission*