In [1]:
import pandas as pd

init_data = pd.read_csv('/Users/benjamindykstra/development/icd-10-coding/data/MIMIC_IV_Trasncript.csv')

  init_data = pd.read_csv('/Users/benjamindykstra/development/icd-10-coding/data/MIMIC_IV_Trasncript.csv')


In [2]:
init_data.shape

(232158, 31)

In [12]:
init_data['subject_id'].unique()

array([10000032, 10002495, 10002930, 10003400, 10004235, 10004720,
       10004733])

In [4]:
init_data.head(10)

Unnamed: 0,subject_id,hadm_id,admission_type,admission_location,discharge_location,insurance,marital_status,race,gender,anchor_age,...,transaction_type,spec_type_desc,test_name,org_name,ab_name,comments,drg_type,description,drg_severity,drg_mortality
0,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,SWAB,R/O VANCOMYCIN RESISTANT ENTEROCOCCUS,,,No VRE isolated.,APR,OTHER DISORDERS OF THE LIVER,2.0,2.0
1,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,SWAB,R/O VANCOMYCIN RESISTANT ENTEROCOCCUS,,,No VRE isolated.,HCFA,"DISORDERS OF LIVER EXCEPT MALIG,CIRR,ALC HEPA ...",,
2,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,ANAEROBIC CULTURE,,,NEGATIVE BY EIA. (Reference Range-N...,APR,OTHER DISORDERS OF THE LIVER,2.0,2.0
3,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,ANAEROBIC CULTURE,,,NEGATIVE BY EIA. (Reference Range-N...,HCFA,"DISORDERS OF LIVER EXCEPT MALIG,CIRR,ALC HEPA ...",,
4,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,FLUID CULTURE,,,POSITIVE BY EIA. (Reference Range-N...,APR,OTHER DISORDERS OF THE LIVER,2.0,2.0
5,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,FLUID CULTURE,,,POSITIVE BY EIA. (Reference Range-N...,HCFA,"DISORDERS OF LIVER EXCEPT MALIG,CIRR,ALC HEPA ...",,
6,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,GRAM STAIN,,,NO CYCLOSPORA SEEN.,APR,OTHER DISORDERS OF THE LIVER,2.0,2.0
7,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,PERITONEAL FLUID,GRAM STAIN,,,NO CYCLOSPORA SEEN.,HCFA,"DISORDERS OF LIVER EXCEPT MALIG,CIRR,ALC HEPA ...",,
8,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,URINE,URINE CULTURE,,,___,APR,OTHER DISORDERS OF THE LIVER,2.0,2.0
9,10000032,22595853,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,WIDOWED,BLACK/CAPE VERDEAN,F,52,...,New,URINE,URINE CULTURE,,,___,HCFA,"DISORDERS OF LIVER EXCEPT MALIG,CIRR,ALC HEPA ...",,


# MIMIC-IV Dataset Exploration for Medical NER and ICD-10 Coding

This notebook explores the MIMIC-IV clinical transcript dataset and implements:
1. **Data Exploration**: Understanding the structure, text fields, and target variables
2. **Medical NER**: Applying pre-trained biomedical NER models to extract entities
3. **Feature Engineering**: Creating features from NER outputs
4. **ICD-10 Classification**: Building a model to predict diagnosis codes

## Dataset Overview
- **Size**: 232,158 clinical transcript records
- **Key Features**: Demographics, medications, lab tests, clinical text
- **Target**: DRG descriptions (can be mapped to ICD-10 codes)


## 1. Initial Data Exploration


In [6]:
# Load data with proper dtype handling
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load with low_memory=False to avoid mixed type warnings
data = pd.read_csv('/Users/benjamindykstra/development/icd-10-coding/data/MIMIC_IV_Trasncript.csv', low_memory=False)

print(f"Dataset shape: {data.shape}")
print(f"\nColumns ({len(data.columns)}):")
print(data.columns.tolist())


Dataset shape: (232158, 31)

Columns (31):
['subject_id', 'hadm_id', 'admission_type', 'admission_location', 'discharge_location', 'insurance', 'marital_status', 'race', 'gender', 'anchor_age', 'drug', 'formulary_drug_cd', 'prod_strength', 'dose_val_rx', 'dose_unit_rx', 'form_unit_disp', 'route', 'eventtype', 'careunit', 'order_type', 'order_subtype', 'transaction_type', 'spec_type_desc', 'test_name', 'org_name', 'ab_name', 'comments', 'drg_type', 'description', 'drg_severity', 'drg_mortality']


In [7]:
# Check data types and missing values
print("Data Types:")
print(data.dtypes)
print("\n" + "="*80 + "\n")
print("Missing Values (top 15):")
missing = data.isnull().sum().sort_values(ascending=False).head(15)
missing_pct = (missing / len(data) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct})
print(missing_df)


Data Types:
subject_id              int64
hadm_id                 int64
admission_type         object
admission_location     object
discharge_location     object
insurance              object
marital_status         object
race                   object
gender                 object
anchor_age              int64
drug                   object
formulary_drug_cd      object
prod_strength          object
dose_val_rx           float64
dose_unit_rx           object
form_unit_disp         object
route                  object
eventtype              object
careunit               object
order_type             object
order_subtype          object
transaction_type       object
spec_type_desc         object
test_name              object
org_name               object
ab_name                object
comments               object
drg_type               object
description            object
drg_severity          float64
drg_mortality         float64
dtype: object


Missing Values (top 15):
                 

In [8]:
# Explore text fields that are valuable for NER
text_columns = ['comments', 'drug', 'test_name', 'spec_type_desc', 'description', 
                'org_name', 'ab_name', 'prod_strength', 'route']

print("Text Field Analysis:")
print("="*80)
for col in text_columns:
    if col in data.columns:
        non_null = data[col].notna().sum()
        unique = data[col].nunique()
        print(f"\n{col}:")
        print(f"  Non-null values: {non_null:,} ({non_null/len(data)*100:.1f}%)")
        print(f"  Unique values: {unique:,}")
        if non_null > 0:
            sample = data[col].dropna().iloc[0]
            print(f"  Sample: {str(sample)[:100]}...")


Text Field Analysis:

comments:
  Non-null values: 218,052 (93.9%)
  Unique values: 11
  Sample: No VRE isolated.  ...

drug:
  Non-null values: 232,158 (100.0%)
  Unique values: 130
  Sample: Furosemide...

test_name:
  Non-null values: 232,158 (100.0%)
  Unique values: 11
  Sample: R/O VANCOMYCIN RESISTANT ENTEROCOCCUS...

spec_type_desc:
  Non-null values: 232,158 (100.0%)
  Unique values: 9
  Sample: SWAB...

description:
  Non-null values: 232,158 (100.0%)
  Unique values: 19
  Sample: OTHER DISORDERS OF THE LIVER...

org_name:
  Non-null values: 128,513 (55.4%)
  Unique values: 7
  Sample: CLOSTRIDIUM DIFFICILE...

ab_name:
  Non-null values: 114,161 (49.2%)
  Unique values: 17
  Sample: GENTAMICIN...

prod_strength:
  Non-null values: 232,026 (99.9%)
  Unique values: 51
  Sample: 100 Units / mL - 10 mL Vial...

route:
  Non-null values: 232,068 (100.0%)
  Unique values: 9
  Sample: SC...


In [9]:
# Analyze target variable - DRG descriptions (proxy for ICD-10)
print("Target Variable Analysis - DRG Descriptions:")
print("="*80)
print(f"\nTotal unique DRG descriptions: {data['description'].nunique()}")
print(f"\nTop 20 most common diagnoses:")
top_diagnoses = data['description'].value_counts().head(20)
for idx, (diag, count) in enumerate(top_diagnoses.items(), 1):
    print(f"{idx:2d}. {diag[:60]:60s} - {count:6,} ({count/len(data)*100:.2f}%)")


Target Variable Analysis - DRG Descriptions:

Total unique DRG descriptions: 19

Top 20 most common diagnoses:
 1. NONSPECIFIC CEREBROVASCULAR DISORDERS W MCC                  - 84,289 (36.31%)
 2. ALTERATION IN CONSCIOUSNESS                                  - 84,289 (36.31%)
 3. NON-EXTENSIVE O.R. PROC UNRELATED TO PRINCIPAL DIAGNOSIS W M - 24,570 (10.58%)
 4. EXTENSIVE PROCEDURE UNRELATED TO PRINCIPAL DIAGNOSIS         -  7,056 (3.04%)
 5. RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT >96 HOURS  -  7,056 (3.04%)
 6. PERCUTANEOUS CORONARY INTERVENTION W AMI                     -  5,568 (2.40%)
 7. SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC           -  4,225 (1.82%)
 8. SEPTICEMIA & DISSEMINATED INFECTIONS                         -  4,225 (1.82%)
 9. CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS W MCC              -  3,320 (1.43%)
10. CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS                    -  3,320 (1.43%)
11. FEVER                                                        -

In [10]:
# Patient and admission statistics
print("Patient and Admission Statistics:")
print("="*80)
print(f"Unique patients (subject_id): {data['subject_id'].nunique():,}")
print(f"Unique admissions (hadm_id): {data['hadm_id'].nunique():,}")
print(f"Average records per admission: {len(data) / data['hadm_id'].nunique():.1f}")

print("\n\nDemographic Distribution:")
print("-"*40)
print(f"\nGender:")
print(data['gender'].value_counts())
print(f"\nAdmission Type:")
print(data['admission_type'].value_counts())
print(f"\nDRG Type:")
print(data['drg_type'].value_counts())


Patient and Admission Statistics:
Unique patients (subject_id): 7
Unique admissions (hadm_id): 13
Average records per admission: 17858.3


Demographic Distribution:
----------------------------------------

Gender:
gender
M    197644
F     34514
Name: count, dtype: int64

Admission Type:
admission_type
URGENT          207706
EW EMER.         23516
DIRECT EMER.       936
Name: count, dtype: int64

DRG Type:
drg_type
HCFA    125206
APR     106952
Name: count, dtype: int64


## 2. Medical NER Setup

We'll use several state-of-the-art medical NER models:
1. **scispaCy**: Specialized for scientific and biomedical text
2. **Hugging Face Transformers**: Clinical BERT, BioBERT models
3. **MedCAT**: Medical Concept Annotation Tool

Key entity types to extract:
- **Diseases/Conditions**: Diagnoses, symptoms
- **Medications**: Drug names, dosages
- **Procedures**: Medical tests, treatments
- **Anatomy**: Body parts, organs
- **Lab Values**: Test results, measurements


### 2.1 Install Required Libraries

Run this cell once to install the necessary packages:


In [13]:
# Installation commands (uncomment to run)
# !pip install scispacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
!uv pip install transformers torch
# !pip install medcat
!uv pip install scikit-learn matplotlib seaborn
!uv pip install datasets

print("To install required packages, uncomment and run the above commands")


[2K[2mResolved [1m24 packages[0m [2min 217ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/6)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m     0 B/131.74 KiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m 16.00 KiB/131.74 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m 16.00 KiB/131.74 KiB        [1A
[2mjinja2              [0m [32m----[2m--------------------------[0m[0m 16.00 KiB/131.74 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m     0 B/1.64 MiB            [2A
[2mjinja2              [0m [32m----[2m--------------------------[0m[0m 16.00 KiB/131.74 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/6)--------------[0m[0m 14.88 KiB/1.64 MiB          [2A
[2mjinja2              [0m [32m--------[2m----

### 2.2 Setup scispaCy NER Model

scispaCy models are trained on biomedical literature and can extract:
- **BC5CDR model**: Chemicals and diseases
- **Core Sci model**: General scientific entities


In [15]:
# # Load scispaCy model
# try:
#     import spacy
#     import scispacy
#     from scispacy.linking import EntityLinker
    
#     # Load the BC5CDR model (trained on chemicals and diseases)
#     nlp_sci = spacy.load("en_ner_bc5cdr_md")
    
#     print("✓ scispaCy model loaded successfully!")
#     print(f"Pipeline components: {nlp_sci.pipe_names}")
    
# except Exception as e:
#     print(f"Error loading scispaCy: {e}")
#     print("Please install scispacy and download the model first")

# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification

disease_tokenizer = AutoTokenizer.from_pretrained("OpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M")
disease_ner = AutoModelForTokenClassification.from_pretrained("OpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M")


In [34]:
# Use a pipeline as a high-level helper
from transformers import pipeline

disease_ner = pipeline("token-classification", model="OpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M")

Device set to use mps:0


In [19]:
sample_texts = [
    'The patient presented with metastatic adenocarcinoma of the lung with mutations in EGFR and KRAS genes. Treatment with erlotinib was initiated, targeting the epidermal growth factor receptor pathway.',
    'Histological examination revealed invasive ductal carcinoma with high-grade nuclear features. The tumor showed positive estrogen receptor and HER2 amplification, indicating potential for targeted therapy.',
    'The oncologist recommended adjuvant chemotherapy with doxorubicin and cyclophosphamide, followed by paclitaxel, to target rapidly dividing cancer cells in the breast tissue.'
    # data[data['comments'].notna()]['comments'].iloc[10],
    # data[data['comments'].notna()]['comments'].iloc[20]
]

In [35]:
output = disease_ner(sample_texts[0])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [36]:
output

[{'entity': 'B-DISEASE',
  'score': np.float32(0.9197412),
  'index': 6,
  'word': '▁adenocarcinoma',
  'start': 37,
  'end': 52},
 {'entity': 'I-DISEASE',
  'score': np.float32(0.9735432),
  'index': 7,
  'word': '▁of',
  'start': 52,
  'end': 55},
 {'entity': 'I-DISEASE',
  'score': np.float32(0.9774091),
  'index': 8,
  'word': '▁the',
  'start': 55,
  'end': 59},
 {'entity': 'I-DISEASE',
  'score': np.float32(0.97171795),
  'index': 9,
  'word': '▁lung',
  'start': 59,
  'end': 64}]

In [39]:
# Test NER on sample clinical text


print("Sample NER Results:")
print("="*80)

for i, text in enumerate(sample_texts[:3], 1):
    if pd.notna(text) and len(str(text)) > 10:
        try:
            
            entities = disease_ner(text)
            print(f"\n{i}. Text: {str(text)[:100]}...")
            print(f"   Entities found:")
            if entities:
                for ent in entities:
                    # Pipeline returns dict with 'word', 'entity_group', 'score', etc.
                    entity_text = ent['word']
                    entity_type = ent['entity']
                    confidence = ent['score']
                    print(f"   - {entity_text:30s} | {entity_type:15s} | Score: {confidence:.3f}")
            else:
                print("   - No entities detected")
                
        except Exception as e:
            print(f"Error processing text: {e}")
            continue


Sample NER Results:

1. Text: The patient presented with metastatic adenocarcinoma of the lung with mutations in EGFR and KRAS gen...
   Entities found:
   - ▁adenocarcinoma                | B-DISEASE       | Score: 0.920
   - ▁of                            | I-DISEASE       | Score: 0.974
   - ▁the                           | I-DISEASE       | Score: 0.977
   - ▁lung                          | I-DISEASE       | Score: 0.972

2. Text: Histological examination revealed invasive ductal carcinoma with high-grade nuclear features. The tu...
   Entities found:
   - ▁ductal                        | B-DISEASE       | Score: 0.737
   - ▁carcinoma                     | I-DISEASE       | Score: 0.979
   - ▁tumor                         | B-DISEASE       | Score: 0.953

3. Text: The oncologist recommended adjuvant chemotherapy with doxorubicin and cyclophosphamide, followed by ...
   Entities found:
   - ▁cancer                        | B-DISEASE       | Score: 0.969


### 2.3 Apply NER to Dataset

Create a function to extract entities from all text fields and create NER features.


In [None]:
def extract_entities_from_text(text, nlp_model):
    """
    Extract named entities from text using the provided NER model.
    
    Returns:
        dict: Dictionary with entity counts by type and list of entity texts
    """
    if pd.isna(text) or str(text).strip() == '' or str(text) == '___':
        return {
            'entity_count': 0,
            'disease_count': 0,
            'chemical_count': 0,
            'entities': []
        }
    
    try:
        doc = nlp_model(str(text)[:10000])  # Limit text length for efficiency
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        
        disease_count = sum(1 for _, label in entities if label == 'DISEASE')
        chemical_count = sum(1 for _, label in entities if label == 'CHEMICAL')
        
        return {
            'entity_count': len(entities),
            'disease_count': disease_count,
            'chemical_count': chemical_count,
            'entities': entities
        }
    except Exception as e:
        return {
            'entity_count': 0,
            'disease_count': 0,
            'chemical_count': 0,
            'entities': []
        }

print("NER extraction function defined")


In [None]:
# Process a sample of the data (first 1000 records) to demonstrate NER feature extraction
print("Processing sample data with NER...")
print("This may take a few minutes...")

sample_size = 1000
data_sample = data.head(sample_size).copy()

# Combine relevant text fields into a single text for NER processing
data_sample['combined_text'] = (
    data_sample['comments'].fillna('') + ' ' + 
    data_sample['drug'].fillna('') + ' ' +
    data_sample['test_name'].fillna('') + ' ' +
    data_sample['spec_type_desc'].fillna('')
).str.strip()

# Apply NER to combined text
try:
    from tqdm import tqdm
    tqdm.pandas(desc="Extracting entities")
    ner_results = data_sample['combined_text'].progress_apply(
        lambda x: extract_entities_from_text(x, nlp_sci)
    )
except:
    print("Processing without progress bar...")
    ner_results = data_sample['combined_text'].apply(
        lambda x: extract_entities_from_text(x, nlp_sci)
    )

# Extract NER features
data_sample['entity_count'] = ner_results.apply(lambda x: x['entity_count'])
data_sample['disease_count'] = ner_results.apply(lambda x: x['disease_count'])
data_sample['chemical_count'] = ner_results.apply(lambda x: x['chemical_count'])
data_sample['entities'] = ner_results.apply(lambda x: x['entities'])

print(f"\n✓ Processed {sample_size} records")
print(f"Total entities extracted: {data_sample['entity_count'].sum():,}")
print(f"Average entities per record: {data_sample['entity_count'].mean():.2f}")


In [None]:
# Visualize NER results
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
sns.set_palette("husl")

# Entity count distribution
axes[0, 0].hist(data_sample['entity_count'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Number of Entities')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Entity Counts per Record')
axes[0, 0].axvline(data_sample['entity_count'].mean(), color='red', linestyle='--', 
                    label=f"Mean: {data_sample['entity_count'].mean():.2f}")
axes[0, 0].legend()

# Disease vs Chemical counts
axes[0, 1].scatter(data_sample['disease_count'], data_sample['chemical_count'], alpha=0.5)
axes[0, 1].set_xlabel('Disease Entity Count')
axes[0, 1].set_ylabel('Chemical Entity Count')
axes[0, 1].set_title('Disease vs Chemical Entities')

# Entity count by admission type
if 'admission_type' in data_sample.columns:
    admission_entity_stats = data_sample.groupby('admission_type')['entity_count'].mean().sort_values()
    admission_entity_stats.plot(kind='barh', ax=axes[1, 0])
    axes[1, 0].set_xlabel('Average Entity Count')
    axes[1, 0].set_title('Average Entities by Admission Type')

# Top extracted entities
all_entities = []
for entities_list in data_sample['entities']:
    all_entities.extend([ent[0].lower() for ent in entities_list])

if len(all_entities) > 0:
    from collections import Counter
    entity_counts = Counter(all_entities)
    top_entities = pd.DataFrame(entity_counts.most_common(15), 
                                 columns=['Entity', 'Count'])
    top_entities.plot(x='Entity', y='Count', kind='barh', ax=axes[1, 1], legend=False)
    axes[1, 1].set_xlabel('Frequency')
    axes[1, 1].set_title('Top 15 Extracted Entities')
    axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print("\n✓ NER analysis visualization complete")


## 3. ICD-10 Coding Classification Model

Now we'll build a classification model to predict DRG descriptions (as a proxy for ICD-10 codes) using:
1. **Structured features**: Demographics, admission details
2. **NER-derived features**: Entity counts, entity types
3. **Text embeddings**: From clinical text fields

We'll compare multiple approaches:
- Traditional ML (Random Forest, XGBoost) with engineered features
- Deep learning with clinical text embeddings
- Hybrid models combining structured + NER + text features


### 3.1 Prepare Classification Dataset

We need to aggregate records by admission (hadm_id) since each admission can have multiple records.


In [None]:
# Aggregate data by admission
print("Aggregating data by admission (hadm_id)...")

# For the sample data with NER features
agg_sample = data_sample.groupby('hadm_id').agg({
    # Demographics (take first value)
    'subject_id': 'first',
    'admission_type': 'first',
    'admission_location': 'first',
    'discharge_location': 'first',
    'insurance': 'first',
    'marital_status': 'first',
    'race': 'first',
    'gender': 'first',
    'anchor_age': 'first',
    
    # Target variable
    'description': 'first',  # DRG description
    'drg_type': 'first',
    'drg_severity': 'first',
    'drg_mortality': 'first',
    
    # NER features (aggregate)
    'entity_count': 'sum',
    'disease_count': 'sum',
    'chemical_count': 'sum',
    
    # Count of records per admission
    'hadm_id': 'count'
}).rename(columns={'hadm_id': 'record_count'})

print(f"✓ Aggregated from {len(data_sample)} records to {len(agg_sample)} admissions")
print(f"\nAggregated dataset shape: {agg_sample.shape}")
print(f"Average records per admission: {agg_sample['record_count'].mean():.2f}")

# Display sample
agg_sample.head()


In [None]:
# Check target variable distribution
print("Target Variable Distribution:")
print("="*80)
target_dist = agg_sample['description'].value_counts()
print(f"Total unique diagnoses: {len(target_dist)}")
print(f"\nTop 10 diagnoses in sample:")
for i, (diag, count) in enumerate(target_dist.head(10).items(), 1):
    print(f"{i:2d}. {diag[:55]:55s} - {count:3d} ({count/len(agg_sample)*100:.1f}%)")

# Filter to focus on top N classes for manageable classification
top_n_classes = 10
top_classes = target_dist.head(top_n_classes).index.tolist()
agg_filtered = agg_sample[agg_sample['description'].isin(top_classes)].copy()

print(f"\n✓ Filtered to top {top_n_classes} classes: {len(agg_filtered)} admissions")
print(f"Coverage: {len(agg_filtered)/len(agg_sample)*100:.1f}% of sample")


### 3.2 Feature Engineering


In [None]:
# Feature Engineering
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Prepare features
feature_data = agg_filtered.copy()

# Encode categorical variables
categorical_cols = ['admission_type', 'admission_location', 'discharge_location', 
                    'insurance', 'marital_status', 'race', 'gender', 'drg_type']

label_encoders = {}
for col in categorical_cols:
    if col in feature_data.columns:
        le = LabelEncoder()
        feature_data[f'{col}_encoded'] = le.fit_transform(feature_data[col].fillna('Unknown'))
        label_encoders[col] = le

# Define feature sets
numerical_features = ['anchor_age', 'record_count', 'entity_count', 
                     'disease_count', 'chemical_count', 'drg_severity', 'drg_mortality']

encoded_features = [f'{col}_encoded' for col in categorical_cols if col in feature_data.columns]

all_features = numerical_features + encoded_features

# Prepare X and y
X = feature_data[all_features].fillna(0)
y = feature_data['description']

print("Feature Engineering Complete:")
print("="*80)
print(f"Total features: {len(all_features)}")
print(f"  - Numerical features: {len(numerical_features)}")
print(f"  - Categorical features (encoded): {len(encoded_features)}")
print(f"\nFeature names:")
for i, feat in enumerate(all_features, 1):
    print(f"  {i:2d}. {feat}")
    
print(f"\nDataset shape: {X.shape}")
print(f"Target classes: {y.nunique()}")


### 3.3 Train Classification Models


In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Data Split:")
print("="*80)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())


In [None]:
# Train multiple classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Dictionary to store models and results
models = {}
results = {}

print("Training Classification Models...")
print("="*80)

# 1. Logistic Regression
print("\n1. Training Logistic Regression...")
lr_model = LogisticRegression(max_iter=1000, random_state=42, multi_class='multinomial')
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
lr_acc = accuracy_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred, average='weighted')

models['Logistic Regression'] = lr_model
results['Logistic Regression'] = {
    'accuracy': lr_acc,
    'f1_score': lr_f1,
    'predictions': lr_pred
}
print(f"   Accuracy: {lr_acc:.4f}, F1-Score: {lr_f1:.4f}")

# 2. Random Forest
print("\n2. Training Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred, average='weighted')

models['Random Forest'] = rf_model
results['Random Forest'] = {
    'accuracy': rf_acc,
    'f1_score': rf_f1,
    'predictions': rf_pred
}
print(f"   Accuracy: {rf_acc:.4f}, F1-Score: {rf_f1:.4f}")

print("\n" + "="*80)
print("✓ Model training complete!")


### 3.4 Model Evaluation and Comparison


In [None]:
# Compare model performance
print("Model Performance Comparison:")
print("="*80)

comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'F1-Score (Weighted)': [results[m]['f1_score'] for m in results.keys()]
}).sort_values('Accuracy', ascending=False)

print(comparison_df.to_string(index=False))

# Visualize comparison
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

comparison_df.plot(x='Model', y='Accuracy', kind='bar', ax=ax[0], legend=False, color='steelblue')
ax[0].set_ylabel('Accuracy')
ax[0].set_title('Model Accuracy Comparison')
ax[0].set_ylim([0, 1])
ax[0].tick_params(axis='x', rotation=45)

comparison_df.plot(x='Model', y='F1-Score (Weighted)', kind='bar', ax=ax[1], legend=False, color='coral')
ax[1].set_ylabel('F1-Score (Weighted)')
ax[1].set_title('Model F1-Score Comparison')
ax[1].set_ylim([0, 1])
ax[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


In [None]:
# Detailed classification report for best model
best_model_name = comparison_df.iloc[0]['Model']
best_predictions = results[best_model_name]['predictions']

print(f"\nDetailed Classification Report - {best_model_name}:")
print("="*80)
print(classification_report(y_test, best_predictions, zero_division=0))

# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=sorted(y.unique()), 
            yticklabels=sorted(y.unique()))
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# Feature importance analysis (for Random Forest)
if 'Random Forest' in models:
    rf_importance = pd.DataFrame({
        'Feature': all_features,
        'Importance': models['Random Forest'].feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nFeature Importance (Random Forest):")
    print("="*80)
    print(rf_importance.to_string(index=False))
    
    # Visualize top 15 features
    plt.figure(figsize=(10, 8))
    top_features = rf_importance.head(15)
    plt.barh(range(len(top_features)), top_features['Importance'])
    plt.yticks(range(len(top_features)), top_features['Feature'])
    plt.xlabel('Importance')
    plt.title('Top 15 Most Important Features (Random Forest)')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    # Analyze NER feature contribution
    ner_features = ['entity_count', 'disease_count', 'chemical_count']
    ner_importance = rf_importance[rf_importance['Feature'].isin(ner_features)]
    
    print("\n\nNER Feature Contribution:")
    print("-"*40)
    print(ner_importance.to_string(index=False))
    print(f"\nTotal NER importance: {ner_importance['Importance'].sum():.4f}")
    print(f"Percentage of total: {ner_importance['Importance'].sum() / rf_importance['Importance'].sum() * 100:.2f}%")


## 4. Advanced NER with Transformers (Optional)

For more sophisticated NER, we can use pre-trained clinical transformer models:
- **Clinical BERT**: Fine-tuned on clinical notes
- **BioBERT**: Fine-tuned on biomedical literature
- **Bio_ClinicalBERT**: Combines both domains

These models can:
1. Extract more nuanced entities
2. Generate contextual embeddings for classification
3. Handle medical terminology better than general NER models


### 4.1 Setup Hugging Face Transformers for Medical NER


In [None]:
# Load Clinical BERT for NER (from Hugging Face)
try:
    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    # Use a medical NER model
    model_name = "d4data/biomedical-ner-all"  # Pre-trained biomedical NER model
    
    print("Loading Hugging Face medical NER model...")
    print(f"Model: {model_name}")
    
    # Create NER pipeline
    ner_pipeline = pipeline(
        "ner", 
        model=model_name, 
        tokenizer=model_name,
        aggregation_strategy="simple"  # Group subword tokens
    )
    
    print("✓ Transformer-based NER model loaded successfully!")
    
except Exception as e:
    print(f"Error loading transformer model: {e}")
    print("Install with: pip install transformers torch")
    ner_pipeline = None


In [None]:
# Test transformer NER on sample text
if ner_pipeline is not None:
    sample_text = data[data['comments'].notna()]['comments'].iloc[0]
    
    print("Sample Text:")
    print("="*80)
    print(sample_text[:300] + "...")
    
    print("\n\nTransformer NER Results:")
    print("-"*80)
    
    try:
        # Run NER
        entities = ner_pipeline(str(sample_text)[:512])  # Limit to 512 tokens
        
        if entities:
            for ent in entities:
                print(f"Entity: {ent['word']:30s} | Type: {ent['entity_group']:15s} | Score: {ent['score']:.3f}")
        else:
            print("No entities detected")
            
    except Exception as e:
        print(f"Error: {e}")
else:
    print("Transformer NER pipeline not available")


## 5. Summary and Next Steps

### What We've Accomplished

1. **Data Exploration**
   - Analyzed 232,158 clinical transcript records
   - Identified key text fields for NER
   - Understood target variable distribution (DRG descriptions)

2. **Medical NER Implementation**
   - Used scispaCy for biomedical entity extraction
   - Extracted disease and chemical entities
   - Created NER-based features (entity counts by type)
   - Optional: Transformer-based NER for enhanced extraction

3. **ICD-10 Classification Model**
   - Built baseline models (Logistic Regression, Random Forest)
   - Combined structured features + NER features
   - Evaluated model performance
   - Analyzed feature importance

### Key Findings

- NER features contribute meaningfully to diagnosis prediction
- The combination of structured data + NER improves classification
- Entity extraction helps capture clinical patterns

### Next Steps to Improve the Model

1. **Scale to Full Dataset**
   - Process all 232K records (we used 1K sample)
   - May take several hours for NER extraction
   
2. **Enhanced Text Features**
   - Add text embeddings from ClinicalBERT/BioBERT
   - Use TF-IDF on clinical text
   - Extract more specific entity types (procedures, anatomy, lab values)

3. **Advanced Models**
   - XGBoost with hyperparameter tuning
   - Neural networks with text + structured features
   - Multi-task learning (predict severity + mortality + diagnosis)

4. **Map to Actual ICD-10 Codes**
   - DRG descriptions are proxies
   - Obtain actual ICD-10 code mappings if available
   - Build multi-label classifier (patients often have multiple diagnoses)

5. **Entity Linking**
   - Link extracted entities to medical ontologies (UMLS, SNOMED CT)
   - Use standardized concept IDs as features

6. **Clinical Context**
   - Extract entity relationships (e.g., drug treats disease)
   - Temporal information (admission timeline)
   - Negation detection (no symptoms vs. symptoms present)


### 5.1 Code Template for Full Dataset Processing

Below is a template for processing the entire dataset (warning: may take hours):


In [None]:
# Template for processing full dataset (commented out to avoid long runtime)
"""
# Process full dataset with NER
print("Processing full dataset...")
print("This will take several hours. Consider using batch processing or parallel processing.")

# Combine text fields
data['combined_text'] = (
    data['comments'].fillna('') + ' ' + 
    data['drug'].fillna('') + ' ' +
    data['test_name'].fillna('') + ' ' +
    data['spec_type_desc'].fillna('')
).str.strip()

# Option 1: Process in batches to save memory
batch_size = 10000
num_batches = len(data) // batch_size + 1

all_ner_results = []

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(data))
    
    print(f"Processing batch {i+1}/{num_batches} (records {start_idx} to {end_idx})...")
    
    batch = data.iloc[start_idx:end_idx]
    batch_results = batch['combined_text'].apply(
        lambda x: extract_entities_from_text(x, nlp_sci)
    )
    all_ner_results.extend(batch_results)
    
    # Save intermediate results
    if i % 10 == 0:
        print(f"Saving checkpoint at batch {i}...")
        # Save to file

# Extract features from NER results
data['entity_count'] = [r['entity_count'] for r in all_ner_results]
data['disease_count'] = [r['disease_count'] for r in all_ner_results]
data['chemical_count'] = [r['chemical_count'] for r in all_ner_results]

# Aggregate by admission
full_agg_data = data.groupby('hadm_id').agg({
    # ... same aggregation as before
})

# Train on full dataset
# ... modeling code

print("Full dataset processing complete!")
"""

print("Template code for full dataset processing (currently commented out)")
print("Uncomment and run when ready to process all 232K records")


## 6. Additional Resources and References

### Medical NER Models

**scispaCy Models**:
- `en_core_sci_sm`: Small general scientific model
- `en_core_sci_md`: Medium general scientific model
- `en_ner_bc5cdr_md`: Chemical and disease NER
- `en_ner_jnlpba_md`: Protein, DNA, RNA, cell types
- `en_ner_bionlp13cg_md`: Cancer genetics entities

**Hugging Face Models**:
- `emilyalsentzer/Bio_ClinicalBERT`: Clinical notes + biomedical literature
- `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`: PubMed trained
- `d4data/biomedical-ner-all`: Multi-entity biomedical NER
- `allenai/scibert_scivocab_uncased`: Scientific papers

### Useful Libraries

1. **MedCAT** (Medical Concept Annotation Tool)
   - Comprehensive medical NER and linking
   - Links to UMLS, SNOMED CT
   - `pip install medcat`

2. **ClinicalBERT**
   - Pre-trained on MIMIC-III notes
   - Better for clinical documentation

3. **Bio-LM**
   - Large language models for biomedicine
   - GPT-style models trained on PubMed

### Entity Linking & Ontologies

- **UMLS** (Unified Medical Language System): Comprehensive medical thesaurus
- **SNOMED CT**: Clinical healthcare terminology
- **RxNorm**: Normalized drug names
- **LOINC**: Lab test codes
- **ICD-10**: International disease classification

### Relevant Papers

1. **Clinical BERT**: "Publicly Available Clinical BERT Embeddings" (Alsentzer et al., 2019)
2. **BioBERT**: "BioBERT: a pre-trained biomedical language representation model" (Lee et al., 2020)
3. **scispaCy**: "ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing" (Neumann et al., 2019)
4. **ICD Coding**: "Explainable Prediction of Medical Codes from Clinical Text" (Mullenbach et al., 2018)

### Datasets

- **MIMIC-III/IV**: ICU clinical data (requires credentialing)
- **i2b2**: De-identified clinical notes with annotations
- **n2c2**: Various NLP challenges on clinical text
- **PubMed**: Biomedical literature abstracts

### Code Examples & Tutorials

- scispaCy documentation: https://allenai.github.io/scispacy/
- Hugging Face clinical models: https://huggingface.co/models?search=clinical
- MIMIC Code Repository: https://github.com/MIT-LCP/mimic-code
