# Introduction

Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection. It occurs when the body’s immune response to an infection becomes uncontrolled, leading to widespread inflammation, tissue damage, and potential organ failure.

The Sepsis-3 definition (2016) by the Third International Consensus Definitions for Sepsis and Septic Shock states:

	•	Sepsis: A life-threatening organ dysfunction caused by a dysregulated host response to infection. Organ dysfunction is identified by an increase of 2 or more points in the Sequential Organ Failure Assessment (SOFA) score.
	•	Septic Shock: A subset of sepsis characterized by circulatory and cellular/metabolic dysfunction associated with a higher risk of mortality. It is clinically identified by:
	•	Persistent hypotension requiring vasopressors to maintain a mean arterial pressure (MAP) ≥ 65 mmHg.
	•	Serum lactate > 2 mmol/L despite adequate fluid resuscitation.

Early recognition and prompt treatment with antibiotics, fluid resuscitation, and organ support are crucial to improving outcomes.  Early predictors of sepsis involve a combination of clinical, laboratory, and physiological markers that indicate an escalating inflammatory response and organ dysfunction. Key early indicators include:

1. Clinical Signs and Symptoms

	•	Fever or Hypothermia (Temperature >38.3°C or <36°C)

	•	Tachycardia (HR >90 bpm in adults)

	•	Tachypnea or Respiratory Distress (RR >22/min)

	•	Altered Mental Status (Confusion, disorientation, or lethargy)

	•	Hypotension (Systolic BP <100 mmHg)

	•	Decreased Urine Output (Oliguria <0.5 mL/kg/h)

2. Laboratory Biomarkers

	•	Elevated White Blood Cell Count (WBC) (>12,000/mm³ or <4,000/mm³)

	•	Elevated Procalcitonin (PCT) (>0.5 ng/mL; >2 ng/mL is highly suggestive of sepsis)

	•	Increased C-Reactive Protein (CRP) (>100 mg/L)

	•	Elevated Lactate (>2 mmol/L suggests tissue hypoxia; >4 mmol/L is severe)

	•	Coagulation Abnormalities (INR >1.5, aPTT >60s, or thrombocytopenia <100,000/mm³)

3. Scoring Systems for Early Detection

	•	qSOFA (Quick SOFA) Score (≥2 suggests a higher risk of sepsis)

	•	RR ≥22/min

	•	Altered mental status (GCS <15)

	•	Systolic BP ≤100 mmHg

	•	SOFA Score (Sequential Organ Failure Assessment; increase by ≥2 points indicates sepsis)

	•	NEWS (National Early Warning Score) (combines vital signs to detect deterioration)

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
!gcloud storage buckets list --project strong-eon-442117-q0 --format="value(name)"

mimic3-dataset


In [None]:
!gcloud storage ls gs://mimic3-dataset/

gs://mimic3-dataset/MIMIC-III/


In [None]:
!gcloud storage ls gs://mimic3-dataset/MIMIC-III/

gs://mimic3-dataset/MIMIC-III/.DS_Store
gs://mimic3-dataset/MIMIC-III/ADMISSIONS.csv
gs://mimic3-dataset/MIMIC-III/CALLOUT.csv
gs://mimic3-dataset/MIMIC-III/CAREGIVERS.csv
gs://mimic3-dataset/MIMIC-III/CHARTEVENTS.csv
gs://mimic3-dataset/MIMIC-III/CPTEVENTS.csv
gs://mimic3-dataset/MIMIC-III/DATETIMEEVENTS.csv
gs://mimic3-dataset/MIMIC-III/DIAGNOSES_ICD.csv
gs://mimic3-dataset/MIMIC-III/DRGCODES.csv
gs://mimic3-dataset/MIMIC-III/D_CPT.csv
gs://mimic3-dataset/MIMIC-III/D_ICD_DIAGNOSES.csv
gs://mimic3-dataset/MIMIC-III/D_ICD_PROCEDURES.csv
gs://mimic3-dataset/MIMIC-III/D_ITEMS.csv
gs://mimic3-dataset/MIMIC-III/D_LABITEMS.csv
gs://mimic3-dataset/MIMIC-III/ICUSTAYS.csv
gs://mimic3-dataset/MIMIC-III/INPUTEVENTS_CV.csv
gs://mimic3-dataset/MIMIC-III/INPUTEVENTS_MV.csv
gs://mimic3-dataset/MIMIC-III/LABEVENTS.csv
gs://mimic3-dataset/MIMIC-III/LICENSE.txt
gs://mimic3-dataset/MIMIC-III/MICROBIOLOGYEVENTS.csv
gs://mimic3-dataset/MIMIC-III/NOTEEVENTS.csv
gs://mimic3-dataset/MIMIC-III/OUTPUTEVENTS.cs

In [None]:
!pip install dask  # Install Dask

import dask.dataframe as dd
import pandas as pd
import numpy as np



**Data Preprocessing (MIMIC-III)**
We need to extract relevant time-series data for patients from MIMIC-III:

- Vital Signs (HR, BP, RR, Temp, O2 Sat, Urine Output)
- Glascow score (check for deteriration)
- Urine output (check for oliguria)
- Laboratory Values (WBC, PCT, CRP, Lactate, INR, Platelet Count)
- Scoring Systems (qSOFA, SOFA, NEWS)

We'll process CHARTEVENTS, LABEVENTS, ICUSTAYS, and ADMISSIONS tables along with NOTES.

# Extracting & Processing Vital Signs (Heart Rate, BP, Resp Rate, Temp, SpO2, O2 Flow).

In [None]:
import dask.dataframe as dd
import pandas as pd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Correct vital sign ITEMIDs
vital_signs = {
    220045: "HeartRate",
    220179: "SystolicBP",
    220180: "DiastolicBP",
    220210: "RespiratoryRate",
    220277: "SpO2",
    223761: "Temperature",
    223834: "O2Flow",
    223835: "InspiredO2Fraction"
}

# Load CHARTEVENTS in chunks using Dask
df_vitals = dd.read_csv(
    f"{bucket_path}CHARTEVENTS.csv",
    usecols=["ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE"],
    dtype={"ICUSTAY_ID": "Int32", "CHARTTIME": "str", "ITEMID": "Int32", "VALUE": "object"},
    assume_missing=True,
    blocksize="50MB"
)

# Convert CHARTTIME to datetime
df_vitals["CHARTTIME"] = dd.to_datetime(df_vitals["CHARTTIME"], errors="coerce")

# Filter only relevant ITEMIDs
df_vitals = df_vitals[df_vitals["ITEMID"].isin(vital_signs.keys())]

# Map ITEMID to human-readable labels
df_vitals["ITEMID"] = df_vitals["ITEMID"].map(vital_signs, meta=("ITEMID", "object"))

# Convert VALUE column to numeric safely
df_vitals["VALUE"] = df_vitals["VALUE"].apply(pd.to_numeric, errors="coerce", meta=("VALUE", "float64"))

# Process data in 500K-row batches
batch_size = 500_000  # Process 500K rows per batch
batch_num = 1

# Instead of counting, just process in smaller chunks
df_vitals = df_vitals.repartition(npartitions=24)  # ~500K rows per partition

for partition in df_vitals.to_delayed():
    print(f"Processing Batch {batch_num}...")

    # Compute only 500K rows (one partition at a time)
    df_batch = partition.compute()

    # Drop NaNs in VALUE column
    df_batch = df_batch.dropna(subset=["VALUE"])

    # Aggregate by mean (combine duplicate time entries)
    df_batch = df_batch.groupby(["ICUSTAY_ID", "CHARTTIME", "ITEMID"], as_index=False)["VALUE"].mean()

    # Pivot table for time-series format
    df_batch = df_batch.pivot(index=["ICUSTAY_ID", "CHARTTIME"], columns="ITEMID", values="VALUE").reset_index()

    # Fill missing values with 0
    df_batch.fillna(0, inplace=True)

    # Save batch to CSV
    batch_filename = f"/content/vital_signs_batch_{batch_num}.csv"
    df_batch.to_csv(batch_filename, index=False)

    print(f"Saved Batch {batch_num} - {batch_filename}")
    batch_num += 1

print("All batches processed successfully!")


Processing Batch 1...
Saved Batch 1 - /content/vital_signs_batch_1.csv
Processing Batch 2...
Saved Batch 2 - /content/vital_signs_batch_2.csv
Processing Batch 3...
Saved Batch 3 - /content/vital_signs_batch_3.csv
Processing Batch 4...
Saved Batch 4 - /content/vital_signs_batch_4.csv
Processing Batch 5...
Saved Batch 5 - /content/vital_signs_batch_5.csv
Processing Batch 6...
Saved Batch 6 - /content/vital_signs_batch_6.csv
Processing Batch 7...
Saved Batch 7 - /content/vital_signs_batch_7.csv
Processing Batch 8...
Saved Batch 8 - /content/vital_signs_batch_8.csv
Processing Batch 9...
Saved Batch 9 - /content/vital_signs_batch_9.csv
Processing Batch 10...
Saved Batch 10 - /content/vital_signs_batch_10.csv
Processing Batch 11...
Saved Batch 11 - /content/vital_signs_batch_11.csv
Processing Batch 12...
Saved Batch 12 - /content/vital_signs_batch_12.csv
Processing Batch 13...
Saved Batch 13 - /content/vital_signs_batch_13.csv
Processing Batch 14...
Saved Batch 14 - /content/vital_signs_batc

# Extract & Process Glasgow Coma Scale (GCS) for Altered Mental Status

In [None]:
import dask.dataframe as dd
import pandas as pd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Glasgow Coma Scale (GCS) ITEMIDs
gcs_itemids = {
    198: "GCS_Total",
    454: "GCS_Verbal",
    184: "GCS_Motor",
    723: "GCS_Eye",
    223900: "GCS_Total",
    220739: "GCS_Verbal",
    223901: "GCS_Motor",
    220745: "GCS_Eye"
}

# Load CHARTEVENTS in true chunks using Dask
df_gcs = dd.read_csv(
    f"{bucket_path}CHARTEVENTS.csv",
    usecols=["ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE"],
    dtype={"ICUSTAY_ID": "Int32", "CHARTTIME": "str", "ITEMID": "Int32", "VALUE": "object"},
    assume_missing=True,
    blocksize="50MB"
)

# Convert CHARTTIME to datetime
df_gcs["CHARTTIME"] = dd.to_datetime(df_gcs["CHARTTIME"], errors="coerce")

# Filter only relevant ITEMIDs (GCS Scores)
df_gcs = df_gcs[df_gcs["ITEMID"].isin(gcs_itemids.keys())]

# Map ITEMID to human-readable labels
df_gcs["ITEMID"] = df_gcs["ITEMID"].map(gcs_itemids, meta=("ITEMID", "object"))

# Convert VALUE column to numeric safely
df_gcs["VALUE"] = df_gcs["VALUE"].apply(pd.to_numeric, errors="coerce", meta=("VALUE", "float64"))

# Process GCS in smaller batches (500K rows per batch)
batch_size = 500_000
batch_num = 1

df_gcs = df_gcs.repartition(npartitions=24)  # ~500K rows per partition

for partition in df_gcs.to_delayed():
    print(f"Processing GCS Batch {batch_num}...")

    # Compute only 500K rows
    df_batch = partition.compute()

    # Aggregate GCS scores to avoid duplicate entries before pivoting
    df_batch = df_batch.groupby(["ICUSTAY_ID", "CHARTTIME", "ITEMID"], as_index=False)["VALUE"].mean()

    # Pivot to wide format (one row per ICUSTAY_ID, CHARTTIME)
    df_batch = df_batch.pivot(index=["ICUSTAY_ID", "CHARTTIME"], columns="ITEMID", values="VALUE").reset_index()

    # Ensure all GCS columns exist (fill missing ones with NaN)
    for col in ["GCS_Total", "GCS_Verbal", "GCS_Motor", "GCS_Eye"]:
        if col not in df_batch.columns:
            df_batch[col] = float("nan")

    # Compute minimum GCS
    df_batch["GCS_Min"] = df_batch[["GCS_Total", "GCS_Verbal", "GCS_Motor", "GCS_Eye"]].min(axis=1)

    # Flag Altered Mental Status (AMS) if GCS ≤ 12
    df_batch["Altered_Mental_Status"] = (df_batch["GCS_Min"] <= 12).astype(int)

    # Keep only necessary columns
    df_batch = df_batch[["ICUSTAY_ID", "CHARTTIME", "Altered_Mental_Status"]]

    # Save batch to CSV
    batch_filename = f"/content/gcs_batch_{batch_num}.csv"
    df_batch.to_csv(batch_filename, index=False)

    print(f"Saved GCS Batch {batch_num} - {batch_filename}")
    batch_num += 1

print("All GCS batches processed successfully!")


Processing GCS Batch 1...
Saved GCS Batch 1 - /content/gcs_batch_1.csv
Processing GCS Batch 2...
Saved GCS Batch 2 - /content/gcs_batch_2.csv
Processing GCS Batch 3...
Saved GCS Batch 3 - /content/gcs_batch_3.csv
Processing GCS Batch 4...
Saved GCS Batch 4 - /content/gcs_batch_4.csv
Processing GCS Batch 5...
Saved GCS Batch 5 - /content/gcs_batch_5.csv
Processing GCS Batch 6...
Saved GCS Batch 6 - /content/gcs_batch_6.csv
Processing GCS Batch 7...
Saved GCS Batch 7 - /content/gcs_batch_7.csv
Processing GCS Batch 8...
Saved GCS Batch 8 - /content/gcs_batch_8.csv
Processing GCS Batch 9...
Saved GCS Batch 9 - /content/gcs_batch_9.csv
Processing GCS Batch 10...
Saved GCS Batch 10 - /content/gcs_batch_10.csv
Processing GCS Batch 11...
Saved GCS Batch 11 - /content/gcs_batch_11.csv
Processing GCS Batch 12...
Saved GCS Batch 12 - /content/gcs_batch_12.csv
Processing GCS Batch 13...
Saved GCS Batch 13 - /content/gcs_batch_13.csv
Processing GCS Batch 14...
Saved GCS Batch 14 - /content/gcs_batc

# Extracting & Processing urine output (Oliguria)

In [None]:
import dask.dataframe as dd
import pandas as pd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Weight ITEMID (Needs to be verified in D_ITEMS.csv)
weight_itemid = {226512}  # Placeholder, verify in D_ITEMS.csv

# Load CHARTEVENTS in chunks using Dask
df_weight = dd.read_csv(
    f"{bucket_path}CHARTEVENTS.csv",
    usecols=["ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE"],
    dtype={"ICUSTAY_ID": "Int32", "CHARTTIME": "str", "ITEMID": "Int32", "VALUE": "object"},
    assume_missing=True,
    blocksize="50MB"  # Load in small chunks
)

# Convert CHARTTIME to datetime
df_weight["CHARTTIME"] = dd.to_datetime(df_weight["CHARTTIME"], errors="coerce")

# Filter only weight ITEMID
df_weight = df_weight[df_weight["ITEMID"].isin(weight_itemid)]

# Convert VALUE column to numeric safely
df_weight["VALUE"] = df_weight["VALUE"].apply(pd.to_numeric, errors="coerce", meta=("VALUE", "float64"))

# Rename column
df_weight = df_weight.rename(columns={"VALUE": "WEIGHT"})

# Keep only necessary columns
df_weight = df_weight[["ICUSTAY_ID", "WEIGHT"]]

# Process in Chunks and Save
batch_size = 500_000  # Process 1M rows at a time
batch_num = 1
df_weight = df_weight.repartition(npartitions=24)  # Split into 24 chunks

for partition in df_weight.to_delayed():
    print(f" Processing Weight Batch {batch_num}...")

    # Compute only 1M rows
    df_batch = partition.compute()

    # Remove NaN values in weight
    df_batch = df_batch.dropna()

    # Save batch to CSV
    batch_filename = f"/content/weight_batch_{batch_num}.csv"
    df_batch.to_csv(batch_filename, index=False)

    print(f"Saved Weight Batch {batch_num} - {batch_filename}")
    batch_num += 1

print("All Weight Batches Processed Successfully!")


🚀 Processing Weight Batch 1...
Saved Weight Batch 1 - /content/weight_batch_1.csv
🚀 Processing Weight Batch 2...
Saved Weight Batch 2 - /content/weight_batch_2.csv
🚀 Processing Weight Batch 3...
Saved Weight Batch 3 - /content/weight_batch_3.csv
🚀 Processing Weight Batch 4...
Saved Weight Batch 4 - /content/weight_batch_4.csv
🚀 Processing Weight Batch 5...
Saved Weight Batch 5 - /content/weight_batch_5.csv
🚀 Processing Weight Batch 6...
Saved Weight Batch 6 - /content/weight_batch_6.csv
🚀 Processing Weight Batch 7...
Saved Weight Batch 7 - /content/weight_batch_7.csv
🚀 Processing Weight Batch 8...
Saved Weight Batch 8 - /content/weight_batch_8.csv
🚀 Processing Weight Batch 9...
Saved Weight Batch 9 - /content/weight_batch_9.csv
🚀 Processing Weight Batch 10...
Saved Weight Batch 10 - /content/weight_batch_10.csv
🚀 Processing Weight Batch 11...
Saved Weight Batch 11 - /content/weight_batch_11.csv
🚀 Processing Weight Batch 12...
Saved Weight Batch 12 - /content/weight_batch_12.csv
🚀 Proce

In [None]:
# Convert ICUSTAY_ID to integer (fix potential merge issues)
df_weight["ICUSTAY_ID"] = pd.to_numeric(df_weight["ICUSTAY_ID"], errors="coerce").astype("Int64")

# Drop any rows where ICUSTAY_ID is NaN (very rare cases)
df_weight = df_weight.dropna(subset=["ICUSTAY_ID"])

# Print final info after fixing
print("Weight Data Fixed!")
print(df_weight.info())

Weight Data Fixed!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22599 entries, 0 to 22598
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ICUSTAY_ID  22599 non-null  Int64  
 1   WEIGHT      22599 non-null  float64
dtypes: Int64(1), float64(1)
memory usage: 375.3 KB
None


In [None]:
df_weight["ICUSTAY_ID"] = pd.to_numeric(df_weight["ICUSTAY_ID"], errors="coerce").astype("Int64")

In [None]:
import pandas as pd
import glob

# Step 1: Load Preprocessed Weight Data
print("Merging all Weight Batches...")

# Find all weight batch files
weight_files = sorted(glob.glob("/content/weight_batch_*.csv"))

# Merge all weight batches into a single DataFrame
df_weight = pd.concat((pd.read_csv(f) for f in weight_files), ignore_index=True)

print("All Weight Batches Merged!")
print(df_weight.info())

# Step 2: Process Urine Output Using Merged Weight Data
import dask.dataframe as dd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Urine Output ITEMIDs
urine_itemids = {
    40055, 40056, 40057, 40061, 40065, 43175, 43176, 43177,
    43348, 43355, 226559
}

# Load OUTPUTEVENTS in chunks using Dask
df_urine = dd.read_csv(
    f"{bucket_path}OUTPUTEVENTS.csv",
    usecols=["ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE"],
    dtype={"ICUSTAY_ID": "Int32", "CHARTTIME": "str", "ITEMID": "Int32", "VALUE": "object"},
    assume_missing=True,
    blocksize="50MB"
)

# Convert CHARTTIME to datetime
df_urine["CHARTTIME"] = dd.to_datetime(df_urine["CHARTTIME"], errors="coerce")

# Filter only relevant ITEMIDs (Urine Output)
df_urine = df_urine[df_urine["ITEMID"].isin(urine_itemids)]

# Convert VALUE column to numeric safely
df_urine["VALUE"] = df_urine["VALUE"].apply(pd.to_numeric, errors="coerce", meta=("VALUE", "float64"))

# Process urine output batch by batch
batch_size = 500_000
batch_num = 1
df_urine = df_urine.repartition(npartitions=24)  # Split into ~500K-row chunks

for partition in df_urine.to_delayed():
    print(f"Processing Urine Output Batch {batch_num}...")

    # Compute only 500K rows
    df_batch = partition.compute()

    # Aggregate duplicate urine outputs per `ICUSTAY_ID`, `CHARTTIME`
    df_batch = df_batch.groupby(["ICUSTAY_ID", "CHARTTIME"], as_index=False)["VALUE"].sum()

    # Merge Urine Output with Patient Weight (Now from merged CSV)
    df_batch = df_batch.merge(df_weight, on="ICUSTAY_ID", how="left")

    # Assign fillna result properly (instead of inplace=True)
    df_batch["WEIGHT"] = df_batch["WEIGHT"].fillna(70)  # Assume 70kg for missing values

    # Calculate Urine Output per kg per hour
    df_batch["Urine_per_kg_per_hour"] = df_batch["VALUE"] / df_batch["WEIGHT"]

    # Flag Oliguria if Urine Output < 0.5 mL/kg/h for >6 hours
    df_batch["Oliguria"] = (df_batch["Urine_per_kg_per_hour"] < 0.5).astype(int)

    # Keep only necessary columns
    df_batch = df_batch[["ICUSTAY_ID", "CHARTTIME", "Oliguria"]]

    # Save batch to CSV
    batch_filename = f"/content/urine_output_batch_{batch_num}.csv"
    df_batch.to_csv(batch_filename, index=False)

    print(f"Saved Urine Output Batch {batch_num} - {batch_filename}")
    batch_num += 1

print("All Urine Output Batches Processed Successfully!")

Merging all Weight Batches...
All Weight Batches Merged!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22599 entries, 0 to 22598
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ICUSTAY_ID  22599 non-null  object 
 1   WEIGHT      22599 non-null  float64
dtypes: float64(1), object(1)
memory usage: 353.2+ KB
None


  df_weight = pd.concat((pd.read_csv(f) for f in weight_files), ignore_index=True)


Processing Urine Output Batch 1...
Saved Urine Output Batch 1 - /content/urine_output_batch_1.csv
Processing Urine Output Batch 2...
Saved Urine Output Batch 2 - /content/urine_output_batch_2.csv
Processing Urine Output Batch 3...
Saved Urine Output Batch 3 - /content/urine_output_batch_3.csv
Processing Urine Output Batch 4...
Saved Urine Output Batch 4 - /content/urine_output_batch_4.csv
Processing Urine Output Batch 5...
Saved Urine Output Batch 5 - /content/urine_output_batch_5.csv
Processing Urine Output Batch 6...
Saved Urine Output Batch 6 - /content/urine_output_batch_6.csv
Processing Urine Output Batch 7...
Saved Urine Output Batch 7 - /content/urine_output_batch_7.csv
Processing Urine Output Batch 8...
Saved Urine Output Batch 8 - /content/urine_output_batch_8.csv
Processing Urine Output Batch 9...
Saved Urine Output Batch 9 - /content/urine_output_batch_9.csv
Processing Urine Output Batch 10...
Saved Urine Output Batch 10 - /content/urine_output_batch_10.csv
Processing Urine 

# Extracting & Processing sepsis-related laboratory biomarkers

In [None]:
import dask.dataframe as dd
import pandas as pd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Relevant Lab ITEMIDs
lab_itemids = {
    51300: "WBC",          # White Blood Cell Count
    50862: "CRP",          # C-Reactive Protein
    50960: "Lactate",      # Lactate
    51237: "INR",          # International Normalized Ratio
    51275: "aPTT",         # Activated Partial Thromboplastin Time
    51265: "Platelets",    # Platelet Count
    50810: "PCT",          # Procalcitonin
    50813: "ScvO2"         # Central Venous Oxygen Saturation
}

# Load LABEVENTS in chunks using Dask (Fix: Use SUBJECT_ID, HADM_ID)
df_labs = dd.read_csv(
    f"{bucket_path}LABEVENTS.csv",
    usecols=["SUBJECT_ID", "HADM_ID", "CHARTTIME", "ITEMID", "VALUE"],
    dtype={"SUBJECT_ID": "Int32", "HADM_ID": "Int32", "CHARTTIME": "str", "ITEMID": "Int32", "VALUE": "object"},
    assume_missing=True,
    blocksize="50MB"
)

# Convert CHARTTIME to datetime
df_labs["CHARTTIME"] = dd.to_datetime(df_labs["CHARTTIME"], errors="coerce")

# Filter only relevant ITEMIDs (Lab results)
df_labs = df_labs[df_labs["ITEMID"].isin(lab_itemids.keys())]

# Map ITEMID to human-readable labels
df_labs["ITEMID"] = df_labs["ITEMID"].map(lab_itemids, meta=("ITEMID", "object"))

# Convert VALUE column to numeric safely
df_labs["VALUE"] = df_labs["VALUE"].apply(pd.to_numeric, errors="coerce", meta=("VALUE", "float64"))

# Load ICU stay mapping to get ICUSTAY_ID
df_icustays = dd.read_csv(
    f"{bucket_path}ICUSTAYS.csv",
    usecols=["SUBJECT_ID", "HADM_ID", "ICUSTAY_ID"],
    dtype={"SUBJECT_ID": "Int32", "HADM_ID": "Int32", "ICUSTAY_ID": "Int32"},
    assume_missing=True
)

# Merge LABEVENTS with ICUSTAYS to get `ICUSTAY_ID`
df_labs = df_labs.merge(df_icustays, on=["SUBJECT_ID", "HADM_ID"], how="left")

# Process Labs in 500K-row chunks
batch_size = 500_000
batch_num = 1

df_labs = df_labs.repartition(npartitions=24)  # ~500K rows per partition

for partition in df_labs.to_delayed():
    print(f"Processing Lab Batch {batch_num}...")

    # Compute only 500K rows
    df_batch = partition.compute()

    # Drop rows without an `ICUSTAY_ID` (focus on ICU patients)
    df_batch = df_batch.dropna(subset=["ICUSTAY_ID"])

    # Aggregate duplicate lab values by mean
    df_batch = df_batch.groupby(["ICUSTAY_ID", "CHARTTIME", "ITEMID"], as_index=False)["VALUE"].mean()

    # Pivot table for time-series format
    df_batch = df_batch.pivot(index=["ICUSTAY_ID", "CHARTTIME"], columns="ITEMID", values="VALUE").reset_index()

    # Save batch to CSV
    batch_filename = f"/content/lab_results_batch_{batch_num}.csv"
    df_batch.to_csv(batch_filename, index=False)

    print(f"Saved Lab Batch {batch_num} - {batch_filename}")
    batch_num += 1

print("All Lab Batches Processed Successfully!")

Processing Lab Batch 1...
Saved Lab Batch 1 - /content/lab_results_batch_1.csv
Processing Lab Batch 2...
Saved Lab Batch 2 - /content/lab_results_batch_2.csv
Processing Lab Batch 3...
Saved Lab Batch 3 - /content/lab_results_batch_3.csv
Processing Lab Batch 4...
Saved Lab Batch 4 - /content/lab_results_batch_4.csv
Processing Lab Batch 5...
Saved Lab Batch 5 - /content/lab_results_batch_5.csv
Processing Lab Batch 6...
Saved Lab Batch 6 - /content/lab_results_batch_6.csv
Processing Lab Batch 7...
Saved Lab Batch 7 - /content/lab_results_batch_7.csv
Processing Lab Batch 8...
Saved Lab Batch 8 - /content/lab_results_batch_8.csv
Processing Lab Batch 9...
Saved Lab Batch 9 - /content/lab_results_batch_9.csv
Processing Lab Batch 10...
Saved Lab Batch 10 - /content/lab_results_batch_10.csv
Processing Lab Batch 11...
Saved Lab Batch 11 - /content/lab_results_batch_11.csv
Processing Lab Batch 12...
Saved Lab Batch 12 - /content/lab_results_batch_12.csv
Processing Lab Batch 13...
Saved Lab Batch 

# Extracting Sepsis related notes in clinical NOTES

Model Choice: BioClinicalBERT
Since general BERT models struggle with medical terminology and BioClinicalBERT understands sepsis, AMS, and ICU-related language.
Instead of using a general BERT model, we will use BioClinicalBERT, a BERT model fine-tuned on clinical notes.




In [None]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

In [None]:
import torch

if torch.cuda.is_available():
    print("GPU is available!")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is NOT available! Running on CPU.")

GPU is NOT available! Running on CPU.


In [None]:
import torch

# Force CPU mode
device = torch.device("cpu")
print(f"Running on {device}")

# Load Transformer Model on CPU
from transformers import AutoModel, AutoTokenizer

model_name = "bert-base-uncased"  # Example: Change this based on your task
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Move model to CPU
model.to(device)

# Example input
text = "Sepsis is a severe response to infection."
inputs = tokenizer(text, return_tensors="pt").to(device)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

print("Model successfully ran on CPU!")

Running on cpu


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model successfully ran on CPU!


We are extracting free-text clinical notes from the NOTEEVENTS.csv file in MIMIC-III. This file contains unstructured textual data written by medical professionals.

Relevant Text Categories for Sepsis Detection: We filtered specific categories of notes that might contain sepsis-related information:

Category	Purpose in Sepsis Detection
- Nursing	Captures real-time patient monitoring, vital signs, fluid balance, and early deterioration signs.
- Physician	Provides clinical reasoning, suspected infections, antibiotic decisions, and differential diagnoses.
Progress	Tracks changes in patient condition, response to treatment, and worsening infections.
- Discharge Summary	Summarizes the entire hospital stay, including final diagnosis and interventions taken.

In these notes, we are checking for sepsis-related phrases and patterns, such as:

1. Clinical Signs of Sepsis
"Fever of 38.5°C", "Tachycardia (HR 120 bpm)", "Hypotensive, BP 85/50", "Altered mental status, confused", "Urine output < 30 mL/hr (oliguria)"
2. Infection & Suspected Sepsis: "Suspected sepsis", "Septic shock", "Bacteremia", "Severe pneumonia", "Gram-negative bacteremia", "Infection source unclear, empiric antibiotics started"
3. Lab and Biomarkers Indicating Sepsis: "Elevated lactate > 2.5 mmol/L", "Elevated WBC count > 15,000", "High CRP 120 mg/L", "INR 1.7, aPTT 65s, suspecting DIC", "Procalcitonin 3.5 ng/mL, highly suspicious of bacterial sepsis"
4. Hemodynamic & Organ Dysfunction: "Started norepinephrine due to persistent hypotension", "MAP below 65 despite fluid resuscitation", "Started mechanical ventilation, worsening ARDS", "AKI developing, creatinine rising"

In [None]:
import dask.dataframe as dd

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/"

# Load only the first few rows to check available columns
df_check = dd.read_csv(f"{bucket_path}NOTEEVENTS.csv", blocksize="10MB")
print(df_check.columns)

Index(['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME',
       'STORETIME', 'CATEGORY', 'DESCRIPTION', 'CGID', 'ISERROR', 'TEXT'],
      dtype='object')


In [None]:
import pandas as pd
import torch
import gc  # Garbage collection to free memory
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Force CPU mode (to avoid GPU issues)
device = torch.device("cpu")
print(f"Running on {device}")

# Load Pretrained Clinical BERT Model
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Binary classification
model.to(device)  # Move model to CPU

print("Bio_ClinicalBERT is now running on CPU!")

# Google Cloud Storage Path
bucket_path = "gs://mimic3-dataset/MIMIC-III/NOTEEVENTS.csv"

# Step 1: Verify Available Column Names (Load a small sample)
print("🔍 Checking column names in NOTEEVENTS.csv...")
df_test = pd.read_csv(bucket_path, nrows=5, encoding="utf-8", low_memory=False)
df_test.columns = df_test.columns.str.strip()  #  Remove extra spaces in column names
available_columns = df_test.columns.tolist()
print(f"✔ Found columns: {available_columns}")

# Step 2: Select correct columns
time_col = "CHARTTIME" if "CHARTTIME" in available_columns else "CHARTDATE"
expected_columns = ["HADM_ID", time_col, "CATEGORY", "TEXT"]
selected_columns = [col for col in expected_columns if col in available_columns]

if not selected_columns:
    raise ValueError("No matching columns found in the dataset!")

print(f"✔ Using columns: {selected_columns}")

# Step 3: Process Data in Smaller Chunks
print("Processing NOTEEVENTS in small chunks to avoid memory crash...")
chunk_size = 25_000  # Reduce batch size (prevents RAM overuse)
batch_num = 1
processed_files = []

# Read in chunks and process on the fly
for chunk in pd.read_csv(bucket_path, usecols=selected_columns, chunksize=chunk_size, encoding="utf-8", low_memory=False):
    print(f"Processing Notes Batch {batch_num}...")

    # Drop rows with missing text
    chunk = chunk.dropna(subset=["TEXT"])

    # Keep only relevant note categories
    relevant_categories = ["Nursing", "Physician", "Progress", "Discharge summary"]
    chunk = chunk[chunk["CATEGORY"].isin(relevant_categories)]

    # Convert CHARTTIME to datetime
    chunk[time_col] = pd.to_datetime(chunk[time_col], errors="coerce")  # Convert safely
    chunk[time_col] = chunk[time_col].astype("datetime64[ns]")  # Explicitly set dtype

    # Skip tokenization for now (just clean and save data)
    batch_filename = f"/content/clean_notes_batch_{batch_num}.parquet"
    chunk.to_parquet(batch_filename, index=False)
    processed_files.append(batch_filename)

    print(f"Saved Clean Notes Batch {batch_num} - {batch_filename}")

    # Free memory manually
    del chunk
    gc.collect()  # Force garbage collection to free RAM

    batch_num += 1

print("All Clean Notes Batches Processed Successfully!")
print(f"Processed files: {processed_files}")


Running on cpu


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Bio_ClinicalBERT is now running on CPU!
🔍 Checking column names in NOTEEVENTS.csv...
✔ Found columns: ['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME', 'STORETIME', 'CATEGORY', 'DESCRIPTION', 'CGID', 'ISERROR', 'TEXT']
✔ Using columns: ['HADM_ID', 'CHARTTIME', 'CATEGORY', 'TEXT']
Processing NOTEEVENTS in small chunks to avoid memory crash...
Processing Notes Batch 1...
Saved Clean Notes Batch 1 - /content/clean_notes_batch_1.parquet
Processing Notes Batch 2...
Saved Clean Notes Batch 2 - /content/clean_notes_batch_2.parquet
Processing Notes Batch 3...
Saved Clean Notes Batch 3 - /content/clean_notes_batch_3.parquet
Processing Notes Batch 4...
Saved Clean Notes Batch 4 - /content/clean_notes_batch_4.parquet
Processing Notes Batch 5...
Saved Clean Notes Batch 5 - /content/clean_notes_batch_5.parquet
Processing Notes Batch 6...
Saved Clean Notes Batch 6 - /content/clean_notes_batch_6.parquet
Processing Notes Batch 7...
Saved Clean Notes Batch 7 - /content/clean_notes_batch_7.p

In [None]:
import pandas as pd
import glob

# Get all saved Parquet files
parquet_files = sorted(glob.glob("/content/clean_notes_batch_*.parquet"))

# Load and concatenate all files into a single DataFrame
df_notes_final = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)

# Save final merged file
df_notes_final.to_parquet("/content/clean_notes_final.parquet", index=False)

print("Merged all notes into clean_notes_final.parquet")

Merged all notes into clean_notes_final.parquet


File clean_notes_final.parquet saved in Google Cloud Storage with the rest of the files. Saved here for ease of access

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
!gcloud storage buckets list --project strong-eon-442117-q0 --format="value(name)"

mimic3-dataset


In [None]:
!gcloud storage ls gs://mimic3-dataset/MIMIC-III/

gs://mimic3-dataset/MIMIC-III/.DS_Store
gs://mimic3-dataset/MIMIC-III/ADMISSIONS.csv
gs://mimic3-dataset/MIMIC-III/CALLOUT.csv
gs://mimic3-dataset/MIMIC-III/CAREGIVERS.csv
gs://mimic3-dataset/MIMIC-III/CHARTEVENTS.csv
gs://mimic3-dataset/MIMIC-III/CPTEVENTS.csv
gs://mimic3-dataset/MIMIC-III/DATETIMEEVENTS.csv
gs://mimic3-dataset/MIMIC-III/DIAGNOSES_ICD.csv
gs://mimic3-dataset/MIMIC-III/DRGCODES.csv
gs://mimic3-dataset/MIMIC-III/D_CPT.csv
gs://mimic3-dataset/MIMIC-III/D_ICD_DIAGNOSES.csv
gs://mimic3-dataset/MIMIC-III/D_ICD_PROCEDURES.csv
gs://mimic3-dataset/MIMIC-III/D_ITEMS.csv
gs://mimic3-dataset/MIMIC-III/D_LABITEMS.csv
gs://mimic3-dataset/MIMIC-III/ICUSTAYS.csv
gs://mimic3-dataset/MIMIC-III/INPUTEVENTS_CV.csv
gs://mimic3-dataset/MIMIC-III/INPUTEVENTS_MV.csv
gs://mimic3-dataset/MIMIC-III/LABEVENTS.csv
gs://mimic3-dataset/MIMIC-III/LICENSE.txt
gs://mimic3-dataset/MIMIC-III/MICROBIOLOGYEVENTS.csv
gs://mimic3-dataset/MIMIC-III/NOTEEVENTS.csv
gs://mimic3-dataset/MIMIC-III/OUTPUTEVENTS.cs

In [None]:
import pandas as pd

# Load the clinical notes
notes_df = pd.read_parquet('gs://mimic3-dataset/MIMIC-III/clean_notes_final.parquet')

# Display basic info
print(notes_df.info())
print(notes_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283208 entries, 0 to 283207
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   HADM_ID    280410 non-null  float64       
 1   CHARTTIME  222172 non-null  datetime64[ns]
 2   CATEGORY   283208 non-null  object        
 3   TEXT       283208 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 8.6+ MB
None
    HADM_ID CHARTTIME           CATEGORY  \
0  167853.0       NaT  Discharge summary   
1  107527.0       NaT  Discharge summary   
2  167118.0       NaT  Discharge summary   
3  196489.0       NaT  Discharge summary   
4  135453.0       NaT  Discharge summary   

                                                TEXT  
0  Admission Date:  [**2151-7-16**]       Dischar...  
1  Admission Date:  [**2118-6-2**]       Discharg...  
2  Admission Date:  [**2119-5-4**]              D...  
3  Admission Date:  [**2124-7-21**]      

In [None]:
import pandas as pd

# Load clinical notes dataset
notes_df = pd.read_parquet("gs://mimic3-dataset/MIMIC-III/clean_notes_final.parquet")

# Load sepsis diagnosis data
diagnoses_df = pd.read_csv("gs://mimic3-dataset/MIMIC-III/DIAGNOSES_ICD.csv")

# Display basic info
print("Clinical Notes Dataset:")
print(notes_df.info())

print("\nSepsis Diagnoses Dataset:")
print(diagnoses_df.info())

# Check column names
print("\nClinical Notes Columns:", notes_df.columns.tolist())
print("\nDiagnosis Columns:", diagnoses_df.columns.tolist())

Clinical Notes Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283208 entries, 0 to 283207
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   HADM_ID    280410 non-null  float64       
 1   CHARTTIME  222172 non-null  datetime64[ns]
 2   CATEGORY   283208 non-null  object        
 3   TEXT       283208 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 8.6+ MB
None

Sepsis Diagnoses Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651047 entries, 0 to 651046
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ROW_ID      651047 non-null  int64  
 1   SUBJECT_ID  651047 non-null  int64  
 2   HADM_ID     651047 non-null  int64  
 3   SEQ_NUM     651000 non-null  float64
 4   ICD9_CODE   651000 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 24.8+ MB
None

Cl

In [None]:
# Define sepsis-related ICD-9/10 codes (expand this list if needed)
sepsis_codes = [
    "038", "038.0", "038.1", "038.10", "038.11", "038.12", "038.19",  # Bacteremia-related
    "995.91", "995.92", "785.52",  # Severe sepsis, septic shock
    "A41", "A41.0", "A41.1", "A41.2", "A41.3", "A41.4", "A41.5", "A41.8", "A41.9"  # ICD-10 Sepsis codes
]

# Convert ICD9 codes to strings for matching
diagnoses_df["ICD9_CODE"] = diagnoses_df["ICD9_CODE"].astype(str)

# Identify sepsis cases
diagnoses_df["sepsis"] = diagnoses_df["ICD9_CODE"].str.startswith(tuple(sepsis_codes))

# Keep only sepsis-positive admissions
sepsis_df = diagnoses_df[diagnoses_df["sepsis"] == True][["HADM_ID", "sepsis"]]

# Remove duplicates (some patients may have multiple sepsis diagnoses)
sepsis_df = sepsis_df.drop_duplicates(subset="HADM_ID")

# Show the number of sepsis cases
print(f"Total Sepsis Cases Identified: {sepsis_df['HADM_ID'].nunique()}")

Total Sepsis Cases Identified: 6265


In [None]:
# Convert HADM_ID to integers for proper merging
notes_df["HADM_ID"] = notes_df["HADM_ID"].fillna(0).astype("Int64")
sepsis_df["HADM_ID"] = sepsis_df["HADM_ID"].astype("Int64")

# Merge clinical notes with sepsis labels
merged_df = notes_df.merge(sepsis_df, on="HADM_ID", how="left")

# Fill NaN values (patients without sepsis) with 0
merged_df["sepsis"] = merged_df["sepsis"].fillna(False).astype(int)

# Check final counts
print(merged_df["sepsis"].value_counts())
print(merged_df.head())

sepsis
0    218925
1     64283
Name: count, dtype: int64
   HADM_ID CHARTTIME           CATEGORY  \
0   167853       NaT  Discharge summary   
1   107527       NaT  Discharge summary   
2   167118       NaT  Discharge summary   
3   196489       NaT  Discharge summary   
4   135453       NaT  Discharge summary   

                                                TEXT  sepsis  
0  Admission Date:  [**2151-7-16**]       Dischar...       0  
1  Admission Date:  [**2118-6-2**]       Discharg...       0  
2  Admission Date:  [**2119-5-4**]              D...       0  
3  Admission Date:  [**2124-7-21**]              ...       0  
4  Admission Date:  [**2162-3-3**]              D...       0  


  merged_df["sepsis"] = merged_df["sepsis"].fillna(False).astype(int)


In [None]:
# Save the final dataset as Parquet
merged_df.to_parquet("/content/sepsis_notes.parquet", index=False)

print("Successfully saved `sepsis_notes.parquet` with labeled sepsis cases!")

Successfully saved `sepsis_notes.parquet` with labeled sepsis cases!


# **Preprocess & Tokenize Text for BioClinicalBERT**

Load Data & Install Necessary Libraries: First, we need to load the dataset and install Hugging Face's Transformers

In [None]:
import pandas as pd
import re
import torch
from transformers import AutoTokenizer

# Load the saved dataset
df = pd.read_parquet("/content/sepsis_notes.parquet")

# Install transformers if not installed
!pip install transformers -q

Preprocessing Clinical Notes: Transformers require clean, normalized text, so we should:
1. Convert text to lowercase
2. Remove special characters & extra spaces
3. Remove dates, numeric tokens, and non-relevant symbols

In [None]:
# Load BioClinicalBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Function to preprocess text
def preprocess_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r"\[.*?\]", "", text)  # Remove bracketed text (e.g., anonymized dates)
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

# Apply preprocessing
df["clean_text"] = df["TEXT"].apply(preprocess_text)

# Display sample cleaned text
print(df[["TEXT", "clean_text"]].head())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

                                                TEXT  \
0  Admission Date:  [**2151-7-16**]       Dischar...   
1  Admission Date:  [**2118-6-2**]       Discharg...   
2  Admission Date:  [**2119-5-4**]              D...   
3  Admission Date:  [**2124-7-21**]              ...   
4  Admission Date:  [**2162-3-3**]              D...   

                                          clean_text  
0  admission date: discharge date: service: adden...  
1  admission date: discharge date: date of birth:...  
2  admission date: discharge date: service: cardi...  
3  admission date: discharge date: service: medic...  
4  admission date: discharge date: date of birth:...  


Tokenize Text for BioClinicalBERT: we convert the text into numerical tokens using BioClinicalBERT
Due to insufficient RAMs, the approach will be Instead of applying .apply() to the entire dataset at once, we process it in small chunks using Dask; and then to tokenize in mini-batches

In [None]:
!pip install dask -q
import dask.dataframe as dd

# Load dataset using Dask for memory efficiency
df = dd.read_parquet("/content/sepsis_notes.parquet")

# Convert Dask dataframe to Pandas in small chunks
df_pandas = df.compute()  # If crashing, reduce sample size (e.g., df.sample(frac=0.5))

In [None]:
# Load dataset
import pandas as pd
df_pandas = pd.read_parquet("/content/sepsis_notes.parquet")

# Print available columns
print("Available columns:", df_pandas.columns.tolist())

Available columns: ['HADM_ID', 'CHARTTIME', 'CATEGORY', 'TEXT', 'sepsis']


In [None]:
import pandas as pd
import re

# Load dataset
df = pd.read_parquet("/content/sepsis_notes.parquet")

# Function to clean text
def preprocess_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r"\[.*?\]", "", text)  # Remove anonymized dates
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

# Apply cleaning
df["clean_text"] = df["TEXT"].apply(preprocess_text)

# Save updated dataframe
df.to_parquet("/content/sepsis_notes_cleaned.parquet", index=False)

print("Successfully added `clean_text` and saved to `sepsis_notes_cleaned.parquet`!")

Successfully added `clean_text` and saved to `sepsis_notes_cleaned.parquet`!


In [None]:
# Reload dataset
df_check = pd.read_parquet("/content/sepsis_notes_cleaned.parquet")

# Print columns
print("Available columns:", df_check.columns.tolist())

# Show sample
print(df_check[["TEXT", "clean_text"]].head())

Available columns: ['HADM_ID', 'CHARTTIME', 'CATEGORY', 'TEXT', 'sepsis', 'clean_text']
                                                TEXT  \
0  Admission Date:  [**2151-7-16**]       Dischar...   
1  Admission Date:  [**2118-6-2**]       Discharg...   
2  Admission Date:  [**2119-5-4**]              D...   
3  Admission Date:  [**2124-7-21**]              ...   
4  Admission Date:  [**2162-3-3**]              D...   

                                          clean_text  
0  admission date: discharge date: service: adden...  
1  admission date: discharge date: date of birth:...  
2  admission date: discharge date: service: cardi...  
3  admission date: discharge date: service: medic...  
4  admission date: discharge date: date of birth:...  


In [None]:
import psutil

# Get total and available memory
total_mem = psutil.virtual_memory().total / (1024**3)  # Convert to GB
available_mem = psutil.virtual_memory().available / (1024**3)  # Convert to GB

print(f"Total RAM: {total_mem:.2f} GB")
print(f"Available RAM: {available_mem:.2f} GB")

Total RAM: 12.67 GB
Available RAM: 10.85 GB


In [None]:
import torch
import pandas as pd
import psutil
import gc
import os
from transformers import AutoTokenizer
import pyarrow.parquet as pq

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

import torch
import pandas as pd
import psutil
import gc
import os
import pyarrow as pa
import pyarrow.parquet as pq
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Load dataset
df_path = "/content/sepsis_notes_cleaned.parquet"
df = pd.read_parquet(df_path, columns=["HADM_ID", "clean_text", "sepsis"])

# Check available RAM
available_mem = psutil.virtual_memory().available / (1024**3)
print(f"Available RAM: {available_mem:.2f} GB")

# Set batch size dynamically (reduce if crashing)
batch_size = 1000
print(f"Using batch size: {batch_size}")

# Define output path
output_path = "/content/sepsis_notes_tokenized.parquet"

# Function for batch tokenization
def batch_tokenization(texts):
    return tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

# Initialize Parquet Writer
parquet_writer = None

# Process text in batches & append to Parquet manually
for i in range(0, len(df), batch_size):
    print(f"Processing batch {i} - {min(i+batch_size, len(df))}...")

    batch_texts = df["clean_text"].iloc[i:i+batch_size].tolist()

    # Tokenize batch
    tokens = batch_tokenization(batch_texts)

    # Convert to dictionary format
    batch_output = pd.DataFrame({
        "HADM_ID": df["HADM_ID"].iloc[i:i+batch_size].values,
        "input_ids": [tokens["input_ids"][j].tolist() for j in range(len(batch_texts))],
        "attention_mask": [tokens["attention_mask"][j].tolist() for j in range(len(batch_texts))],
        "sepsis": df["sepsis"].iloc[i:i+batch_size].values
    })

    # Convert Pandas DataFrame to PyArrow Table
    table = pa.Table.from_pandas(batch_output)

    # If first batch, create Parquet file, otherwise append
    if parquet_writer is None:
        parquet_writer = pq.ParquetWriter(output_path, table.schema, compression="snappy")

    parquet_writer.write_table(table)

    # Free memory manually after each batch
    del tokens, batch_texts, batch_output, table
    torch.cuda.empty_cache()  # Clear GPU memory
    gc.collect()  # Clear CPU memory

# Close Parquet writer after all batches are processed
if parquet_writer:
    parquet_writer.close()

print("Successfully tokenized dataset and saved it in batches!")

Available RAM: 7.37 GB
Using batch size: 1000
Processing batch 0 - 1000...
Processing batch 1000 - 2000...
Processing batch 2000 - 3000...
Processing batch 3000 - 4000...
Processing batch 4000 - 5000...
Processing batch 5000 - 6000...
Processing batch 6000 - 7000...
Processing batch 7000 - 8000...
Processing batch 8000 - 9000...
Processing batch 9000 - 10000...
Processing batch 10000 - 11000...
Processing batch 11000 - 12000...
Processing batch 12000 - 13000...
Processing batch 13000 - 14000...
Processing batch 14000 - 15000...
Processing batch 15000 - 16000...
Processing batch 16000 - 17000...
Processing batch 17000 - 18000...
Processing batch 18000 - 19000...
Processing batch 19000 - 20000...
Processing batch 20000 - 21000...
Processing batch 21000 - 22000...
Processing batch 22000 - 23000...
Processing batch 23000 - 24000...
Processing batch 24000 - 25000...
Processing batch 25000 - 26000...
Processing batch 26000 - 27000...
Processing batch 27000 - 28000...
Processing batch 28000 -

In [None]:
import pandas as pd

# Load tokenized dataset
df_tokenized = pd.read_parquet("/content/sepsis_notes_tokenized.parquet")

# Print basic info
print(df_tokenized.info())
print(df_tokenized.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283208 entries, 0 to 283207
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   HADM_ID         283208 non-null  Int64 
 1   input_ids       283208 non-null  object
 2   attention_mask  283208 non-null  object
 3   sepsis          283208 non-null  int64 
dtypes: Int64(1), int64(1), object(2)
memory usage: 8.9+ MB
None
   HADM_ID                                          input_ids  \
0   167853  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
1   107527  [101, 10296, 2236, 131, 12398, 2236, 131, 2236...   
2   167118  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
3   196489  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
4   135453  [101, 10296, 2236, 131, 12398, 2236, 131, 2236...   

                                      attention_mask  sepsis  
0  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...       0  
1  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [None]:
!gsutil cp /content/sepsis_notes_tokenized.parquet gs://mimic3-dataset/MIMIC-III/

print("Tokenized dataset successfully uploaded to GCS!")

Copying file:///content/sepsis_notes_tokenized.parquet [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/166.9 MiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\
Operation completed over 1 objects/166.9 MiB.                                    
Tokenized dataset successfully uploaded to GCS!


# **Training BioClinicalBERT**

Load tokenized data from Google Cloud Storage (GCS)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

# Define the GCS path
#gcs_path = "https://drive.google.com/file/d/1gVzuNiiQs-r6vDQvj_OS7luZy-lVIFvF/view?usp=drive_link"

# Load tokenized dataset
df = pd.read_parquet('/content/drive/MyDrive/MIMIC-III_sepsis_notes_tokenized.parquet')

# Show dataset structure
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283208 entries, 0 to 283207
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   HADM_ID         283208 non-null  Int64 
 1   input_ids       283208 non-null  object
 2   attention_mask  283208 non-null  object
 3   sepsis          283208 non-null  int64 
dtypes: Int64(1), int64(1), object(2)
memory usage: 8.9+ MB
None
   HADM_ID                                          input_ids  \
0   167853  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
1   107527  [101, 10296, 2236, 131, 12398, 2236, 131, 2236...   
2   167118  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
3   196489  [101, 10296, 2236, 131, 12398, 2236, 131, 1555...   
4   135453  [101, 10296, 2236, 131, 12398, 2236, 131, 2236...   

                                      attention_mask  sepsis  
0  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...       0  
1  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Convert to Tensor Dataset: We now create PyTorch tensors that will be used for training. Convert Data for PyTorch

Since input_ids and attention_mask are stored as lists in object columns, we need to convert them to tensors.

In [3]:
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader

class SepsisDataset(Dataset):
    def __init__(self, dataframe):
        # Convert lists of tokens to PyTorch tensors efficiently
        self.input_ids = torch.tensor(np.array(dataframe["input_ids"].tolist()), dtype=torch.long)
        self.attention_mask = torch.tensor(np.array(dataframe["attention_mask"].tolist()), dtype=torch.long)
        self.labels = torch.tensor(dataframe["sepsis"].values, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "labels": self.labels[idx]
        }

# Create dataset
dataset = SepsisDataset(df)

# Split into train & validation sets (80/20 split)
train_size = int(0.8 * len(dataset))
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, len(dataset) - train_size])

# Create DataLoaders
train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
val_loader = DataLoader(val_set, batch_size=16, shuffle=False)

print(f"Train samples: {len(train_set)}, Validation samples: {len(val_set)}")

Train samples: 226566, Validation samples: 56642


Now we load BioClinicalBERT & define our model:

In [4]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model & tokenizer
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Set Up Training: we define
- Loss function: CrossEntropyLoss
- Optimizer: AdamW
- Evaluation metric: Accuracy, Precision, Recall, F1-score

In [5]:
import torch.optim as optim
from transformers import TrainingArguments, Trainer

# Set up training parameters (Disable W&B)
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,  # Reduce if crashing
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs",
    save_total_limit=2,
    fp16=True,  # Mixed precision for efficiency
    report_to="none"  # Disables Weights & Biases (W&B)
)


# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=val_set
)

print("Training setup optimized & ready to start!")



Training setup optimized & ready to start!


# **Training the Model**

In [6]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.5433,0.542276
2,0.5281,0.532399
3,0.5508,0.53228


TrainOutput(global_step=84963, training_loss=0.5397352676119518, metrics={'train_runtime': 4617.2114, 'train_samples_per_second': 147.21, 'train_steps_per_second': 18.401, 'total_flos': 1.7883605810608128e+17, 'train_loss': 0.5397352676119518, 'epoch': 3.0})

# **Evaluating Model Performance**

In [7]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Get predictions
predictions = trainer.predict(val_set)
pred_labels = predictions.predictions.argmax(axis=1)

# Extract true labels
true_labels = [example["labels"].item() for example in val_set]

# Compute accuracy, precision, recall, F1-score
accuracy = accuracy_score(true_labels, pred_labels)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, pred_labels, average="binary")

print(f" Model Performance:")
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1-score: {f1:.4f}")

 Model Performance:
 Accuracy: 0.7757
 Precision: 0.0000
 Recall: 0.0000
 F1-score: 0.0000


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
