# Patient Data Processing Tutorial

This notebook demonstrates how to use the patient modules in the medical project for patient data analysis, trial simulation, and patient matching.

## Overview

The patient module provides three main functionalities:
1. **Patient Prediction**: AI-powered patient outcome prediction
2. **Trial Patient Simulation**: Simulate patient data for clinical trials
3. **Trial Patient Matching**: Match patients to appropriate clinical trials

## Table of Contents
1. [Setup and Environment Check](#setup)
2. [Patient Prediction](#patient-prediction)
3. [Trial Patient Simulation](#trial-simulation)
4. [Trial Patient Matching](#patient-matching)
5. [Demo Data and Utilities](#demo-utilities)

## 1. Setup and Environment Check {#setup}

In [2]:
# Check if the patient modules exist and are accessible
import os
import sys
import warnings
from pathlib import Path
import pandas as pd
import numpy as np

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Get the current notebook directory
notebook_dir = Path.cwd()
project_root = notebook_dir.parent.parent  # Assuming notebook is in tutorial/
patient_path = project_root / "medicalproject2024" / "preprocess" / "patient"

print(f"Notebook directory: {notebook_dir}")
print(f"Project root: {project_root}")
print(f"Patient path: {patient_path}")
print(f"Patient path exists: {patient_path.exists()}")

if patient_path.exists():
    print("Patient modules found!")
    
    # Check available files
    files = {
        "__init__.py": patient_path / "__init__.py",
        "bert.py": patient_path / "bert.py",
        "demo.py": patient_path / "demo.py",
        "demo_data.py": patient_path / "demo_data.py",
        "patient_data.py": patient_path / "patient_data.py",
        "tabular_utils.py": patient_path / "tabular_utils.py",
        "vocab_data.py": patient_path / "vocab_data.py",
        "demo_data/": patient_path / "demo_data",
        "trial_patient_match/": patient_path / "trial_patient_match",
        "trial_simulation/": patient_path / "trial_simulation"
    }
    
    print("\nAvailable files and directories:")
    for file_name, file_path in files.items():
        if file_path.exists():
            if file_path.is_dir():
                file_count = len(list(file_path.rglob("*")))
                print(f"  {file_name:20} (directory with {file_count} items)")
            else:
                file_size = file_path.stat().st_size
                print(f"  {file_name:20} ({file_size:,} bytes)")
        else:
            print(f"  {file_name:20} (missing)")
        
else:
    print("Patient modules not found!")

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    print(f"\nPython path updated with: {project_root}")

Notebook directory: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/tutorial
Project root: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health
Patient path: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/preprocess/patient
Patient path exists: True
Patient modules found!

Available files and directories:
  __init__.py          (86 bytes)
  bert.py              (5,257 bytes)
  demo.py              (3,482 bytes)
  demo_data.py         (3,396 bytes)
  patient_data.py      (20,623 bytes)
  tabular_utils.py     (16,479 bytes)
  vocab_data.py        (1,324 bytes)
  demo_data/           (directory with 21 items)
  trial_patient_match/ (directory with 4 items)
  trial_simulation/    (directory with 4 items)

Python path updated with: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health


In [None]:
def test_patient_imports():
    """Test importing patient modules"""
    print("Testing Patient Module Imports:")
    print("-" * 40)
    
    modules_to_test = [
        ("medicalproject2024.preprocess.patient", "patient_prediction", "Patient Prediction"),
        ("medicalproject2024.preprocess.patient", "trial_patient_simulation", "Trial Simulation"),
        ("medicalproject2024.preprocess.patient", "trial_patient_matching", "Patient Matching"),
    ]
    
    import_results = {}
    imported_functions = {}
    
    for module_path, function_name, display_name in modules_to_test:
        try:
            module = __import__(module_path, fromlist=[function_name])
            func = getattr(module, function_name)
            import_results[display_name] = True
            imported_functions[function_name] = func
            print(f"{display_name:20}: Successfully imported {function_name}")
        except ImportError as e:
            import_results[display_name] = False
            print(f"{display_name:20}: Import failed - {str(e)[:50]}...")
        except AttributeError as e:
            import_results[display_name] = False
            print(f"{display_name:20}: Function not found - {str(e)[:50]}...")
        except Exception as e:
            import_results[display_name] = False
            print(f"{display_name:20}: Warning - {str(e)[:50]}...")
    
    return import_results, imported_functions

# Run import tests
import_results, patient_functions = test_patient_imports()

Testing Patient Module Imports:
----------------------------------------
Patient Prediction  : Successfully imported patient_prediction
Trial Simulation    : Successfully imported trial_patient_simulation
Patient Matching    : Successfully imported trial_patient_matching


## 2. Patient Prediction {#patient-prediction}

Run patient prediction exactly as in test_patient.py

In [4]:
def run_patient_prediction():
    """Run patient prediction as in test_patient.py"""
    print("Running Patient Prediction:")
    print("=" * 40)
    
    if "patient_prediction" not in patient_functions:
        print("patient_prediction function not available")
        return None
    
    try:
        print("Executing: patient_prediction()")
        
        # This is the exact line from test_patient.py
        a = patient_functions["patient_prediction"]()
        
        print("Patient prediction completed!")
        print(f"   Result type: {type(a)}")
        
        if hasattr(a, 'shape'):
            print(f"   Result shape: {a.shape}")
        elif isinstance(a, (list, tuple)):
            print(f"   Result length: {len(a)}")
        elif isinstance(a, dict):
            print(f"   Result keys: {list(a.keys())}")
        elif a is not None:
            print(f"   Result: {str(a)[:100]}..." if len(str(a)) > 100 else f"   Result: {a}")
        
        return a
        
    except Exception as e:
        print(f"Patient prediction failed: {e}")
        print("   This might be due to missing data files or model dependencies")
        return None

# Run patient prediction
if import_results.get("Patient Prediction", False):
    prediction_result = run_patient_prediction()
else:
    print("Skipping patient prediction due to import failure")
    prediction_result = None

Running Patient Prediction:
Executing: patient_prediction()
   subject_id       age    gender  ethnicity  mortality
0    0.000008   0.00000  0.280221   0.886755   0.311455
1    0.000084  76.52293  0.280221   0.325827   0.811455
2    0.000103   0.00000  0.280221   0.886755   0.311455
3    0.000158   0.00000  0.780221   0.325827   0.311455
4    0.000204   0.00000  0.280221   0.325827   0.311455
   subject_id       age    gender  ethnicity  mortality
0    0.000008   0.00000  0.448863   0.028028   0.496538
1    0.000084  76.52293  0.227268   0.261858   0.624924
2    0.000103   0.00000  0.393350   0.005155   0.231926
3    0.000158   0.00000  0.990098   0.245922   0.225179
4    0.000204   0.00000  0.323575   0.085772   0.331818
   subject_id       age gender ethnicity  mortality
0           2   0.00000      M     ASIAN      False
1           3  76.52293      M     WHITE       True
2           5   0.00000      M     ASIAN      False
3           7   0.00000      F     WHITE      False
4       

## 3. Trial Patient Simulation {#trial-simulation}

Run trial patient simulation exactly as in test_patient.py

In [5]:
# Run trial patient simulation (exactly as in test_patient.py: b = trial_patient_simulation())
def run_trial_patient_simulation():
    """Run trial patient simulation as in test_patient.py"""
    print("Running Trial Patient Simulation:")
    print("=" * 40)
    
    if "trial_patient_simulation" not in patient_functions:
        print("trial_patient_simulation function not available")
        return None
    
    try:
        print("Executing: trial_patient_simulation()")
        
        # This is the exact line from test_patient.py
        b = patient_functions["trial_patient_simulation"]()
        
        print("Trial patient simulation completed!")
        print(f"   Result type: {type(b)}")
        
        if hasattr(b, 'shape'):
            print(f"   Result shape: {b.shape}")
        elif isinstance(b, (list, tuple)):
            print(f"   Result length: {len(b)}")
        elif isinstance(b, dict):
            print(f"   Result keys: {list(b.keys())}")
        elif isinstance(b, pd.DataFrame):
            print(f"   DataFrame shape: {b.shape}")
            print(f"   Columns: {list(b.columns)[:5]}{'...' if len(b.columns) > 5 else ''}")
        elif b is not None:
            print(f"   Result: {str(b)[:100]}..." if len(str(b)) > 100 else f"   Result: {b}")
        
        return b
        
    except Exception as e:
        print(f"Trial patient simulation failed: {e}")
        print("   This might be due to missing simulation parameters or data")
        return None

# Run trial patient simulation
if import_results.get("Trial Simulation", False):
    simulation_result = run_trial_patient_simulation()
else:
    print("Skipping trial simulation due to import failure")
    simulation_result = None

Running Trial Patient Simulation:
Executing: trial_patient_simulation()
       race  post-menopause  \
0  0.171795        0.345142   
1  0.311100        0.950714   
2  0.240955        0.470512   
3  0.454416        0.015893   
4  0.258307        0.772865   

   human epidermal growth factor receptor 2 is positive  treatment  \
0                                           0.345667      0.421827   
1                                           0.212333      0.755880   
2                                           0.418049      0.955986   
3                                           0.002395      0.465470   
4                                           0.127597      0.575113   

   tumor laterality  estrogen receptor positive  \
0          0.006412                    0.641281   
1          0.176768                    0.811455   
2          0.432446                    0.093315   
3          0.067200                    0.017711   
4          0.512800                    0.894053   

   progestero

## 4. Trial Patient Matching {#patient-matching}

Run trial patient matching exactly as in test_patient.py

In [6]:
# Run trial patient matching (exactly as in test_patient.py: c = trial_patient_matching())
def run_trial_patient_matching():
    """Run trial patient matching as in test_patient.py"""
    print("Running Trial Patient Matching:")
    print("=" * 40)
    
    if "trial_patient_matching" not in patient_functions:
        print("trial_patient_matching function not available")
        return None
    
    try:
        print("Executing: trial_patient_matching()")
        
        # This is the exact line from test_patient.py
        c = patient_functions["trial_patient_matching"]()
        
        print("Trial patient matching completed!")
        print(f"   Result type: {type(c)}")
        
        if hasattr(c, 'shape'):
            print(f"   Result shape: {c.shape}")
        elif isinstance(c, (list, tuple)):
            print(f"   Result length: {len(c)}")
            if len(c) > 0:
                print(f"   Sample items: {c[:3]}")
        elif isinstance(c, dict):
            print(f"   Result keys: {list(c.keys())}")
            for key, value in list(c.items())[:3]:
                print(f"     {key}: {type(value)}")
        elif isinstance(c, pd.DataFrame):
            print(f"   DataFrame shape: {c.shape}")
            print(f"   Columns: {list(c.columns)[:5]}{'...' if len(c.columns) > 5 else ''}")
        elif c is not None:
            print(f"   Result: {str(c)[:100]}..." if len(str(c)) > 100 else f"   Result: {c}")
        
        return c
        
    except Exception as e:
        print(f"Trial patient matching failed: {e}")
        print("   This might be due to missing patient data or trial criteria")
        return None

# Run trial patient matching
if import_results.get("Patient Matching", False):
    matching_result = run_trial_patient_matching()
else:
    print("Skipping patient matching due to import failure")
    matching_result = None

Running Trial Patient Matching:
Executing: trial_patient_matching()


BERT encoding total samples 180: 100%|██████████| 3/3 [00:01<00:00,  1.52it/s]
BERT encoding total samples 151: 100%|██████████| 3/3 [00:00<00:00, 18.35it/s]


visit: 
 [[[1, 2, 3, 4]], [[2, 1, 5, 3]], [[6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]], [[34, 35, 5, 3, 36, 37, 2, 1, 38, 39]], [[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 20, 55, 56, 14, 57, 58, 59, 60, 61, 62, 63, 64, 65, 6, 11, 66, 10, 8, 67, 68, 69, 29, 27, 25, 70, 71, 72, 73, 19]]]
feature: 
 [[0.00000000e+00 2.98981009e-02 3.64721404e-02]
 [7.65229295e+01 1.00422936e-01 2.43718328e-01]
 [0.00000000e+00 3.37891136e-01 3.57477891e-03]
 [0.00000000e+00 9.87152994e-01 1.58522936e-01]
 [0.00000000e+00 4.63243481e-03 4.69673509e-01]]
label: 
 [[[1, 26, 27, 28, 29, 30], [1, 50, 51, 52]], [[1, 26, 27, 28, 29, 30], [1, 50, 51, 52]], [[1, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179], [128, 1, 129, 131, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]], [[32, 1, 31], [1, 53, 54, 55]], [[1, 13, 14, 15, 16], [1

## 5. Demo Data and Utilities {#demo-utilities}

Explore demo data and utility functions available in the patient module

In [7]:
# Explore demo data directory
def explore_demo_data():
    """Explore the demo data available in the patient module"""
    print("Exploring Demo Data:")
    print("=" * 30)
    
    demo_data_path = patient_path / "demo_data"
    
    if not demo_data_path.exists():
        print("Demo data directory not found")
        return
    
    print(f"Demo data directory: {demo_data_path}")
    
    # Explore subdirectories
    subdirs = {
        "patient/sequence/": "Patient sequence data",
        "patient/tabular/": "Patient tabular data",
        "trial_outcome_data/": "Trial outcome data",
        "trial_patient_data/": "Trial patient data"
    }
    
    for subdir, description in subdirs.items():
        subdir_path = demo_data_path / subdir
        if subdir_path.exists():
            file_count = len(list(subdir_path.rglob("*")))
            print(f"  {subdir:25}: {description} ({file_count} files)")
            
            # Show sample files
            sample_files = list(subdir_path.rglob("*"))[:3]
            for file_path in sample_files:
                if file_path.is_file():
                    file_size = file_path.stat().st_size
                    print(f"    📄 {file_path.name} ({file_size:,} bytes)")
        else:
            print(f"  {subdir:25}: {description} (not found)")

# Explore demo data
explore_demo_data()

Exploring Demo Data:
Demo data directory: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/preprocess/patient/demo_data
  patient/sequence/        : Patient sequence data (4 files)
    📄 preprocess.py (2,918 bytes)
    📄 voc.json (87,146 bytes)
    📄 patient_tabular.csv (866,127 bytes)
  patient/tabular/         : Patient tabular data (2 files)
    📄 preprocess.py (1,415 bytes)
    📄 patient_tabular.csv (866,127 bytes)
  trial_outcome_data/      : Trial outcome data (9 files)
    📄 phase_III_train.csv (8,294,175 bytes)
    📄 phase_I_train.csv (3,675,648 bytes)
    📄 phase_II_test.csv (7,014,043 bytes)
  trial_patient_data/      : Trial patient data (1 files)
    📄 data_processed.csv (394,636 bytes)


In [8]:
# Test demo functionality
def test_demo_functions():
    """Test demo functions if available"""
    print("\nTesting Demo Functions:")
    print("-" * 30)
    
    try:
        # Try to import demo modules
        from medicalproject2024.preprocess.patient import demo_data, demo
        print("Successfully imported demo modules")
        
        # List available functions
        demo_items = [name for name in dir(demo) if not name.startswith('_')]
        demo_data_items = [name for name in dir(demo_data) if not name.startswith('_')]
        
        if demo_items:
            print(f"   Demo functions: {demo_items[:5]}{'...' if len(demo_items) > 5 else ''}")
        
        if demo_data_items:
            print(f"   Demo data functions: {demo_data_items[:5]}{'...' if len(demo_data_items) > 5 else ''}")
        
    except ImportError as e:
        print(f"Could not import demo modules: {e}")
    except Exception as e:
        print(f"Error testing demo functions: {e}")

# Test additional modules
def test_additional_modules():
    """Test additional patient modules"""
    print("\nTesting Additional Modules:")
    print("-" * 30)
    
    additional_modules = [
        ("bert", "BERT model utilities"),
        ("patient_data", "Patient data processing"),
        ("tabular_utils", "Tabular data utilities"),
        ("vocab_data", "Vocabulary data processing")
    ]
    
    for module_name, description in additional_modules:
        try:
            module = __import__(f"medicalproject2024.preprocess.patient.{module_name}", fromlist=[''])
            items = [name for name in dir(module) if not name.startswith('_')]
            print(f"{module_name:15}: {description}")
            if items:
                print(f"   Available items: {items[:3]}{'...' if len(items) > 3 else ''}")
        except ImportError as e:
            print(f"{module_name:15}: Import failed - {str(e)[:40]}...")
        except Exception as e:
            print(f"{module_name:15}: Warning - {str(e)[:40]}...")

# Run additional tests
test_demo_functions()
test_additional_modules()


Testing Demo Functions:
------------------------------
Successfully imported demo modules
   Demo functions: ['PatientData', 'TabularPatient', 'TabularPatientBase', 'TrialData', 'load_mimic_ehr_sequence']...
   Demo data functions: ['TabularPatientBase', 'dill', 'json', 'load_mimic_ehr_sequence', 'load_table_config']...

Testing Additional Modules:
------------------------------
bert           : BERT model utilities
   Available items: ['AutoModel', 'AutoTokenizer', 'BERT']...
patient_data   : Patient data processing
   Available items: ['Dataset', 'HyperTransformer', 'OrderedDict']...
tabular_utils  : Tabular data utilities
   Available items: ['BaseTransformer', 'BinaryEncoder', 'Config']...
vocab_data     : Vocabulary data processing
   Available items: ['Vocab']


## Summary

This notebook demonstrates the patient data processing workflow following the exact structure from `test_patient.py`:

### Key Functions (from test_patient.py):
1. **`a = patient_prediction()`**: AI-powered patient outcome prediction
2. **`b = trial_patient_simulation()`**: Simulate patient data for clinical trials
3. **`c = trial_patient_matching()`**: Match patients to appropriate clinical trials

### Module Structure:
- **bert.py**: BERT model utilities for NLP tasks
- **demo.py**: Demo functions and examples
- **demo_data.py**: Demo data loading utilities
- **patient_data.py**: Core patient data processing
- **tabular_utils.py**: Utilities for tabular data processing
- **vocab_data.py**: Vocabulary and text data processing

### Data Directories:
- **demo_data/patient/**: Patient sequence and tabular data
- **trial_outcome_data/**: Clinical trial outcome datasets
- **trial_patient_data/**: Trial patient information
- **trial_patient_match/**: Patient-trial matching algorithms
- **trial_simulation/**: Trial simulation utilities

### Usage Notes:
- All three main functions can be called without parameters
- Results may include predictions, simulated data, or matching scores
- Demo data is available for testing and development
- Additional utilities support BERT-based NLP and tabular data processing

### Next Steps:
- Use your own patient data with the prediction pipeline
- Customize trial simulation parameters
- Implement custom matching criteria for specific trials
- Explore the BERT-based patient text analysis capabilities