# ICU Length of Stay Prediction - MIMIC-III Pipeline

## 🎯 Objective
Predict ICU stay duration using PySpark ML on MIMIC-III dataset

## 📊 Data & Constraints
- **Sources**: 6 MIMIC-III tables (CHARTEVENTS, LABEVENTS, ICUSTAYS, etc.)
- **Filters**: 
        - Patient Age 18-80
        - LOS 0.1-15 days
        - Valid time sequences
- **Timeframe**: Vitals (first 24h), Labs (6h pre to 24h post ICU)


## 🌀 Big Data Processing

- **Storage**: We used Google Cloud Dataproc and Google Storage Buckets for MIMIC-III storage 
- **CHARTEVENTS**: Chart Events table has +330 million rows
- **Parquet**: Converted "CHARTEVENTS" and "LABEVENTS" tables to Parquet format for efficient storage and processing
- **Filtering**: We filtered immediately when loading to optimize CHARTEVENTS DataFrame

## 🔧 Features (39 total)
- **Demographics (2)**: Age, gender
- **Admission (8)**: Emergency/elective, timing, insurance
- **ICU Units (6)**: Care unit types, transfers
- **Vitals (11)**: HR, BP, RR, temp, SpO2 (avg/std)
- **Labs (8)**: Creatinine, glucose, electrolytes, blood counts
- **Diagnoses (4)**: Total count, sepsis, respiratory failure

## 🤖 Models & Results
- **Linear Regression**: 
- **Random Forest**: 

## ☁️ Infrastructure
- **GCP Dataproc**: 1x Master and 2x Workers, n2-standard-4  (12 vCPUs, 48GB RAM, 400GB Disk Storage)
- **Optimizations**: Smart sampling, aggressive filtering, 80/20 split





## Cenas a acresentar no relatorio:

* justificar o pq de cada uma das colunas
* dar tune aos hiperparametros do modelo
* referencias e bibliografias :
    *

## Import Libraries

In [194]:
# 📦 PySpark Core Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window


# 🔢 Data Processing & Feature Engineering
from pyspark.ml.feature import (
    VectorAssembler,
    StandardScaler,
    StringIndexer,
    MinMaxScaler,
    Imputer
)
from pyspark.ml.functions import vector_to_array


# 🤖 Machine Learning Models
from pyspark.ml.regression import (
    RandomForestRegressor,
    LinearRegression
    # GBTRegressor
)


# 📊 Model Evaluation & Tuning
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline
import operator
import builtins


# ⏱️ Date/Time Utilities
from datetime import datetime, timedelta
import time


print("\n✅ All imports loaded successfully!")
print(f"⏰ Notebook started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")


✅ All imports loaded successfully!
⏰ Notebook started at: 2025-06-05 13:58:26



## Setup Spark Session

In [195]:
spark = SparkSession.builder \
    .appName("Forecast-LOS") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    \
    .config("spark.executor.memory", "5g") \
    .config("spark.executor.cores", "2") \
    .config("spark.executor.instances", "2") \
    \
    .config("spark.driver.memory", "10g") \
    .config("spark.driver.cores", "3") \
    .config("spark.driver.maxResultSize", "2g") \
    \
    .config("spark.sql.shuffle.partitions", "32") \
    .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "64MB") \
    .config("spark.sql.files.maxPartitionBytes", "128MB") \
    .config("spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold", "32MB") \
    \
    .config("spark.network.timeout", "600s") \
    .config("spark.sql.broadcastTimeout", "300s") \
    .config("spark.rpc.askTimeout", "300s") \
    \
    .config("spark.executor.heartbeatInterval", "20s") \
    .config("spark.dynamicAllocation.enabled", "false") \
    \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "128MB") \
    \
    .config("spark.executor.memoryOffHeap.enabled", "true") \
    .config("spark.executor.memoryOffHeap.size", "1g") \
    \
    .getOrCreate()
print("✅ Spark session created successfully!")
print(f"📊 Spark Version: {spark.version}")
print(f"🔧 Application Name: {spark.sparkContext.appName}")
print(f"💾 Available cores: {spark.sparkContext.defaultParallelism}")
print(f"\n⏰ Spark session initialised at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Spark session created successfully!
📊 Spark Version: 4.0.0
🔧 Application Name: Forecast-LOS
💾 Available cores: 8

⏰ Spark session initialised at: 2025-06-05 13:58:26


# Load Data

Strategy: Pre-filter CHARTEVENTS to find ICU stays with required vital signs, then efficiently load all tables using broadcast joins and lookup tables.
Key Steps:

- Filter for ICU stays with ≥1 of 6 vital signs (HR, BP, RR, Temp, SpO2)
- Create lookup tables for ICUSTAY_ID, HADM_ID, SUBJECT_ID
- Load all tables with pre-filtering using broadcast joins
- Convert large files to "Parquet" for performance

Result: Memory-efficient loading of only relevant data with quality assurance that all ICU stays have vital signs measurements.

In [196]:
#Configuration flags
#SAMPLE_ENABLE = False
#SAMPLE_SIZE = 20000
#MIMIC_PATH = "gs://dataproc-staging-europe-west2-851143487985-hir6gfre/mimic-data"
#
#
#
#print("🏥 Loading MIMIC-III data...")
#
## Step 1: First, find ICUSTAY_IDs that have ALL required vital signs
#print("📂 Loading CHARTEVENTS...")
#
#try:
#    chartevents_df = spark.read.parquet(f"{MIMIC_PATH}/CHARTEVENTS.parquet")
#    print("✅ Loaded CHARTEVENTS from parquet")
#except:
#    print("📄 Converting CHARTEVENTS.csv.gz to parquet...")
#    chartevents_csv = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/CHARTEVENTS.csv.gz")
#    chartevents_csv.write.mode("overwrite").parquet(f"{MIMIC_PATH}/CHARTEVENTS.parquet")
#    chartevents_df = spark.read.parquet(f"{MIMIC_PATH}/CHARTEVENTS.parquet")
#    print("✅ Converted and loaded CHARTEVENTS")
#
#
#
## Step 2: Load ICUSTAYS 
#print("\n📂 Loading and filtering ICUSTAYS...")
#icustays_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/ICUSTAYS.csv.gz")
#
#
#
## Step 3: Apply sampling if enabled
#if SAMPLE_ENABLE:
#    print(f"🎯 Sampling {SAMPLE_SIZE} ICU stays...")
#    icustays_df = icustays_df.limit(SAMPLE_SIZE)
#    icustays_df.cache()
#    actual_sample_size = icustays_df.count()
#    print(f"✅ Final sample: {actual_sample_size} ICU stays")
#else:
#    icustays_df.cache()
#    actual_sample_size = icustays_df.count()
#
#    
#    
## Step 4: Create efficient lookup tables
#print("📋 Creating ID lookup tables...")
#icu_lookup = icustays_df.select("ICUSTAY_ID").distinct().cache()
#hadm_lookup = icustays_df.select("HADM_ID").distinct().cache()
#subject_lookup = icustays_df.select("SUBJECT_ID").distinct().cache()
#
#icu_lookup.count()  # Trigger caching
#hadm_lookup.count()
#subject_lookup.count()
#
## Step 5: Load other tables with optimized joins
#print("📂 Loading PATIENTS table...")
#patients_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/PATIENTS.csv.gz")
#patients_df = patients_df.join(broadcast(subject_lookup), "SUBJECT_ID", "inner")
#
#print("📂 Loading ADMISSIONS table...")
#admissions_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/ADMISSIONS.csv.gz")
#admissions_df = admissions_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")
#
#print("📂 Loading DIAGNOSES_ICD table...")
#diagnoses_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/DIAGNOSES_ICD.csv.gz")
#diagnoses_df = diagnoses_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")
#
## Step 6: Load and filter CHARTEVENTS efficiently
#print("📂 Loading CHARTEVENTS table... [FILTERING BY ICUSTAY_ID]")
#chartevents_df = chartevents_df \
#    .select("ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE", "VALUEUOM", "VALUENUM") \
#    .join(broadcast(icu_lookup), "ICUSTAY_ID", "inner")
#
## Step 7: Load LABEVENTS
#print("📂 Loading LABEVENTS table... [FILTERING BY HADM_ID]")
#try:
#    labevents_df = spark.read.parquet(f"{MIMIC_PATH}/LABEVENTS.parquet")
#except:
#    print("📄 Converting LABEVENTS.csv.gz to parquet...")
#    labevents_csv = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/LABEVENTS.csv.gz")
#    labevents_csv.write.mode("overwrite").parquet(f"{MIMIC_PATH}/LABEVENTS.parquet")
#    labevents_df = spark.read.parquet(f"{MIMIC_PATH}/LABEVENTS.parquet")
#
#labevents_df = labevents_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")
#
## Final summary
#print("\n✅ Data loading complete!")
#print(f"📊 ICUSTAYS: {icustays_df.count():,} rows")
#print(f"📊 PATIENTS: {patients_df.count():,} rows") 
#print(f"📊 ADMISSIONS: {admissions_df.count():,} rows")
#print(f"📊 DIAGNOSES_ICD: {diagnoses_df.count():,} rows")
#print(f"📊 CHARTEVENTS (filtered): {chartevents_df.count():,} rows")
#print(f"📊 LABEVENTS (filtered): {labevents_df.count():,} rows")
#print(f"\n⏰ Data loaded at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

TIRAR ISTO ANTES DE ENTREGAR: ISTO E PARA CORRER LOCALMENTE!!!

In [197]:
#Configuration flags
SAMPLE_ENABLE = False
SAMPLE_SIZE = 20000
MIMIC_PATH = "mimic-db-short"



print("🏥 Loading MIMIC-III data...")

# Step 1: First, find ICUSTAY_IDs that have ALL required vital signs
print("📂 Loading CHARTEVENTS...")


chartevents_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/CHARTEVENTS.csv")
print("✅ Loaded CHARTEVENTS from parquet")




# Step 2: Load ICUSTAYS 
print("\n📂 Loading and filtering ICUSTAYS...")
icustays_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/ICUSTAYS.csv")



# Step 3: Apply sampling if enabled
if SAMPLE_ENABLE:
    print(f"🎯 Sampling {SAMPLE_SIZE} ICU stays...")
    icustays_df = icustays_df.limit(SAMPLE_SIZE)
    icustays_df.cache()
    actual_sample_size = icustays_df.count()
    print(f"✅ Final sample: {actual_sample_size} ICU stays")
else:
    icustays_df.cache()
    actual_sample_size = icustays_df.count()

    
    
# Step 4: Create efficient lookup tables
print("📋 Creating ID lookup tables...")
icu_lookup = icustays_df.select("ICUSTAY_ID").distinct().cache()
hadm_lookup = icustays_df.select("HADM_ID").distinct().cache()
subject_lookup = icustays_df.select("SUBJECT_ID").distinct().cache()

icu_lookup.count()  # Trigger caching
hadm_lookup.count()
subject_lookup.count()

# Step 5: Load other tables with optimized joins
print("📂 Loading PATIENTS table...")
patients_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/PATIENTS.csv")
patients_df = patients_df.join(broadcast(subject_lookup), "SUBJECT_ID", "inner")

print("📂 Loading ADMISSIONS table...")
admissions_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/ADMISSIONS.csv")
admissions_df = admissions_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")

print("📂 Loading DIAGNOSES_ICD table...")
diagnoses_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/DIAGNOSES_ICD.csv")
diagnoses_df = diagnoses_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")

# Step 6: Load and filter CHARTEVENTS efficiently
print("📂 Loading CHARTEVENTS table... [FILTERING BY ICUSTAY_ID]")
chartevents_df = chartevents_df \
    .select("ICUSTAY_ID", "CHARTTIME", "ITEMID", "VALUE", "VALUEUOM", "VALUENUM") \
    .join(broadcast(icu_lookup), "ICUSTAY_ID", "inner")

# Step 7: Load LABEVENTS
print("📂 Loading LABEVENTS table... [FILTERING BY HADM_ID]")

labevents_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{MIMIC_PATH}/LABEVENTS.csv")


labevents_df = labevents_df.join(broadcast(hadm_lookup), "HADM_ID", "inner")

# Final summary
print("\n✅ Data loading complete!")
print(f"📊 ICUSTAYS: {icustays_df.count():,} rows")
print(f"📊 PATIENTS: {patients_df.count():,} rows") 
print(f"📊 ADMISSIONS: {admissions_df.count():,} rows")
print(f"📊 DIAGNOSES_ICD: {diagnoses_df.count():,} rows")
print(f"📊 CHARTEVENTS (filtered): {chartevents_df.count():,} rows")
print(f"📊 LABEVENTS (filtered): {labevents_df.count():,} rows")
print(f"\n⏰ Data loaded at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🏥 Loading MIMIC-III data...
📂 Loading CHARTEVENTS...
✅ Loaded CHARTEVENTS from parquet

📂 Loading and filtering ICUSTAYS...
📋 Creating ID lookup tables...


25/06/05 13:58:27 WARN CacheManager: Asked to cache already cached data.
25/06/05 13:58:27 WARN CacheManager: Asked to cache already cached data.
25/06/05 13:58:27 WARN CacheManager: Asked to cache already cached data.
25/06/05 13:58:27 WARN CacheManager: Asked to cache already cached data.


📂 Loading PATIENTS table...
📂 Loading ADMISSIONS table...
📂 Loading DIAGNOSES_ICD table...
📂 Loading CHARTEVENTS table... [FILTERING BY ICUSTAY_ID]
📂 Loading LABEVENTS table... [FILTERING BY HADM_ID]

✅ Data loading complete!
📊 ICUSTAYS: 20 rows
📊 PATIENTS: 20 rows
📊 ADMISSIONS: 20 rows
📊 DIAGNOSES_ICD: 212 rows
📊 CHARTEVENTS (filtered): 57,973 rows
📊 LABEVENTS (filtered): 5,895 rows

⏰ Data loaded at: 2025-06-05 13:58:28


# Features Engineering



## Extracting Data From ICUSTAYS

**Purpose**: Create comprehensive ICU dataset by joining ICU stays with patient demographics and admission details.

**Key Features**:
- **Target Variable**: ICU_LOS_DAYS (length of stay)
- **Demographics**: Age (18-80), gender, ethnicity
- **Clinical**: Care units, admission type/location, insurance
- **Outcomes**: Hospital/patient death flags
- **Identifiers**: ICUSTAY_ID, SUBJECT_ID, HADM_ID

**Age Filter**: Adults only (18-80 years) to exclude pediatric/very elderly edge cases.

**Alive Filter**: Only include people who did survive the ICU stay.

**LOS Filter**: Get only LOS values within a range that does'nt include outliers.

**Result**: Clean base dataset ready for vital signs feature engineering.

In [198]:
print("📊 Step 1: Creating base ICU dataset with patient demographics...")

base_icu_df = icustays_df.alias("icu") \
    .join(patients_df.alias("pat"), "SUBJECT_ID", "inner") \
    .join(admissions_df.alias("adm"), ["SUBJECT_ID", "HADM_ID"], "inner") \
    .select(
        # ICU stay identifiers
        col("icu.ICUSTAY_ID"),
        col("icu.SUBJECT_ID"), 
        col("icu.HADM_ID"),
        
        # Target variable - Length of Stay in ICU (days)
        col("icu.LOS").alias("ICU_LOS_DAYS"),
        
        # ICU characteristics
        col("icu.FIRST_CAREUNIT"),
        col("icu.LAST_CAREUNIT"), 
        col("icu.INTIME").alias("ICU_INTIME"),
        col("icu.OUTTIME").alias("ICU_OUTTIME"),
        
        # Patient demographics
        col("pat.GENDER"),
        col("pat.EXPIRE_FLAG").alias("PATIENT_DIED"),
        col("pat.DOB"),
        
        # Admission details
        col("adm.ADMITTIME"),
        col("adm.DISCHTIME"), 
        col("adm.ADMISSION_TYPE"),
        col("adm.ADMISSION_LOCATION"),
        col("adm.INSURANCE"),
        col("adm.ETHNICITY"),
        col("adm.MARITAL_STATUS"),
        col("adm.RELIGION"),
        col("adm.HOSPITAL_EXPIRE_FLAG").alias("HOSPITAL_DEATH"),
        col("adm.DIAGNOSIS").alias("ADMISSION_DIAGNOSIS")
    )

# Calculate age at ICU admission
base_icu_df = base_icu_df.withColumn("AGE_AT_ICU_ADMISSION", \
                                     floor(datediff(col("ICU_INTIME"), col("DOB")) / 365.25)) \
                                     .filter(col("AGE_AT_ICU_ADMISSION").between(18,80)) \
                                    .filter(col("PATIENT_DIED").isin(0))


print("✅ Created base ICU dataset!")

📊 Step 1: Creating base ICU dataset with patient demographics...
✅ Created base ICU dataset!


In [199]:
print("\n📈 ICU Length of Stay Statistics (Days):")
base_icu_df.select("ICU_LOS_DAYS").describe().show()


📈 ICU Length of Stay Statistics (Days):
+-------+-----------------+
|summary|     ICU_LOS_DAYS|
+-------+-----------------+
|  count|               12|
|   mean|2.360116666666667|
| stddev|2.264151572918985|
|    min|            0.848|
|    max|           8.9163|
+-------+-----------------+



We kept every ICU STAY that had duration (LOS) between 0.0 and 9.1 days, considered normal legnths since:

| Statistic                | Value (days)                                    |
| ------------------------ | ----------------------------------------------- |
| **Minimum**              | 0.0 (can be admission + discharge on same day)  |
| **25th percentile (Q1)** | \~1.1                                           |
| **Median (Q2)**          | \~2.1                                           |
| **75th percentile (Q3)** | \~4.3                                           |
| **Maximum**              | \~88 (but can go slightly higher in edge cases) |
| **Mean**                 | \~3.3–3.5                                       |

Using interquartile range (IQR) method:

* IQR = Q3 - Q1 = 4.3 - 1.1 = ~3.2

* Upper Bound for outliers = Q3 + 1.5 × IQR ≈ 4.3 + 4.8 = ~9.1 days

* Lower Bound = Q1 - 1.5 × IQR ≈ 1.1 - 4.8 = < 0, which is ignored since LOS can’t be negative

So:

* Typical ICU LOS: 1.1 to 4.3 days

* Outliers: ICU stays longer than ~9.1 days

In [200]:
# Print initial dataset size
print(f"Number of rows before removing LOS outliers: {base_icu_df.count()}")

print("📊 Cleaning target variable...")

# Filter to keep only records with ICU_LOS_DAYS between 0 and 9.1 days
base_icu_df = base_icu_df.filter(
    (col("ICU_LOS_DAYS") >= 0.0) & 
    (col("ICU_LOS_DAYS") <= 9.1)
).cache()

print("✅ Base ICU Dataset - Outliers Removed")

# Print filtered dataset size
print(f"Number of rows after removing LOS outliers: {base_icu_df.count()}")

Number of rows before removing LOS outliers: 12
📊 Cleaning target variable...
✅ Base ICU Dataset - Outliers Removed
Number of rows after removing LOS outliers: 12


25/06/05 13:58:28 WARN CacheManager: Asked to cache already cached data.


## Extracting Categorical Features

**Features Created**:
- **GENDER_BINARY**: Male = 1, Female = 0
- **CAME_FROM_ER**: Emergency admission = 1
- **HAS_INSURANCE**: Medicare = 1, other = 0
- **ADMISSION_TYPE_ENCODED**: Emergency=1, Elective=2, Urgent=3, Other=0
- **ETHNICITY_ENCODED**: White=1, Black=2, Hispanic=3, Asian=4, Other=5
- **MARITAL_STATUS_ENCODED**: Married=1, Single=2, Divorced=3, Widowed=4, Separated=5, LifePartener=6, Other=0
- **RELIGION_ENCODED**: Catholic=1, Protestant=2, Jewish=3, Other=0



**Result**: Categorical variables converted to numerical format for ML models.

In [201]:
print("📊 Step 2: Engineering categorical features...")
base_icu_df = base_icu_df \
    .withColumn("GENDER_BINARY", when(col("GENDER") == "M", 1).otherwise(0)) \
    .withColumn("CAME_FROM_ER", when(col("ADMISSION_LOCATION").contains("EMERGENCY"), 1).otherwise(0)) \
    .withColumn("HAS_INSURANCE", when(col("INSURANCE") == "Medicare", 1).otherwise(0)) \
    .withColumn("ADMISSION_TYPE_ENCODED", 
                when(col("ADMISSION_TYPE") == "EMERGENCY", 1)
                .when(col("ADMISSION_TYPE") == "ELECTIVE", 2)
                .when(col("ADMISSION_TYPE") == "URGENT", 3)
                .otherwise(0)) \
    .withColumn("ETHNICITY_ENCODED",
                when(col("ETHNICITY").contains("WHITE"), 1)
                .when(col("ETHNICITY").contains("BLACK"), 2)
                .when(col("ETHNICITY").contains("HISPANIC"), 3)
                .when(col("ETHNICITY").contains("ASIAN"), 4)
                .otherwise(5)) \
    .withColumn("MARITAL_STATUS_ENCODED",
                when(col("MARITAL_STATUS") == "MARRIED", 1)
                .when(col("MARITAL_STATUS") == "SINGLE", 2)
                .when(col("MARITAL_STATUS") == "DIVORCED", 3)
                .when(col("MARITAL_STATUS") == "WIDOWED", 4)
                .when(col("MARITAL_STATUS") == "SEPARATED", 5)
                .when(col("MARITAL_STATUS") == "LIFE PARTNER", 6)
                .otherwise(0)) \
    .withColumn("RELIGION_ENCODED",
                when(col("RELIGION").contains("CATHOLIC"), 1)
                .when(col("RELIGION").contains("PROTESTANT"), 2)
                .when(col("RELIGION").contains("JEWISH"), 3)
                .otherwise(0))

print("✅ Base ICU Dataset - Categorical Features")


📊 Step 2: Engineering categorical features...
✅ Base ICU Dataset - Categorical Features


## Extracting ICU Unit Types

**Purpose**: Create categorical features for ICU unit types and transfers.

**Features Created**:
- **FIRST_UNIT_ENCODED**: Numerical encoding of ICU units
 - MICU (Medical) = 1
 - SICU (Surgical) = 2  
 - CSRU (Cardiac Surgery) = 3
 - CCU (Coronary Care) = 4
 - TSICU (Trauma Surgical) = 5
 - Other = 0
- **CHANGED_ICU_UNIT**: Binary flag (1 if patient transferred between units)

**Clinical Significance**: Different ICU types have varying complexity and typical LOS patterns. Unit transfers often indicate complications.

**Result**: Enhanced dataset with ICU unit complexity and transfer indicators.

In [202]:
print("📊 Step 3: Creating ICU unit type features...")

base_icu_df = base_icu_df \
    .withColumn("FIRST_UNIT_ENCODED", 
                when(col("FIRST_CAREUNIT") == "MICU", 1)
                .when(col("FIRST_CAREUNIT") == "SICU", 2)
                .when(col("FIRST_CAREUNIT") == "CSRU", 3)
                .when(col("FIRST_CAREUNIT") == "CCU", 4)
                .when(col("FIRST_CAREUNIT") == "TSICU", 5)
                .otherwise(0)) \
    .withColumn("CHANGED_ICU_UNIT", 
                when(col("FIRST_CAREUNIT") != col("LAST_CAREUNIT"), 1).otherwise(0))


print("✅ Base ICU Dataset - Unit Type Features")

📊 Step 3: Creating ICU unit type features...
✅ Base ICU Dataset - Unit Type Features


## Extracting Time-based Features

**Action**: Filter out invalid records where INTIME >= OUTTIME.


In [203]:
print("📊 Step 4: Creating time-based features...")
base_icu_df = base_icu_df \
    .filter(col("ICU_INTIME") < col("ICU_OUTTIME"))
print("✅ Base ICU Dataset - Time Based Features")

📊 Step 4: Creating time-based features...
✅ Base ICU Dataset - Time Based Features


Select useful columns brom the base df

In [204]:
base_icu_df.show()


+----------+----------+-------+------------+--------------+-------------+-------------------+-------------------+------+------------+-------------------+-------------------+-------------------+--------------+--------------------+---------+--------------------+--------------+-----------------+--------------+--------------------+--------------------+-------------+------------+-------------+----------------------+-----------------+----------------------+----------------+------------------+----------------+
|ICUSTAY_ID|SUBJECT_ID|HADM_ID|ICU_LOS_DAYS|FIRST_CAREUNIT|LAST_CAREUNIT|         ICU_INTIME|        ICU_OUTTIME|GENDER|PATIENT_DIED|                DOB|          ADMITTIME|          DISCHTIME|ADMISSION_TYPE|  ADMISSION_LOCATION|INSURANCE|           ETHNICITY|MARITAL_STATUS|         RELIGION|HOSPITAL_DEATH| ADMISSION_DIAGNOSIS|AGE_AT_ICU_ADMISSION|GENDER_BINARY|CAME_FROM_ER|HAS_INSURANCE|ADMISSION_TYPE_ENCODED|ETHNICITY_ENCODED|MARITAL_STATUS_ENCODED|RELIGION_ENCODED|FIRST_UNIT_ENCODED|

In [205]:
print("📊 Step 5: Dropping useless columns...")

# List of columns to drop (fixed syntax)
drop_cols = [
    "FIRST_CAREUNIT",
    "LAST_CAREUNIT",
    "GENDER",
    "PATIENT_DIED",
    "DOB",
    "ADMITTIME",
    "DISCHTIME",
    "ADMISSION_TYPE",
    "ADMISSION_LOCATION",
    "INSURANCE",
    "ETHNICITY",
    "MARITAL_STATUS",
    "RELIGION",
    "HOSPITAL_DEATH",
    "ADMISSION_DIAGNOSIS"
]

# Keep all columns except those in drop_cols
base_icu_df = base_icu_df.drop(*drop_cols)

print("✅ Base ICU Dataset - Finalized")
base_icu_df.show(5)  # Showing first 5 rows for brevity

📊 Step 5: Dropping useless columns...
✅ Base ICU Dataset - Finalized
+----------+----------+-------+------------+-------------------+-------------------+--------------------+-------------+------------+-------------+----------------------+-----------------+----------------------+----------------+------------------+----------------+
|ICUSTAY_ID|SUBJECT_ID|HADM_ID|ICU_LOS_DAYS|         ICU_INTIME|        ICU_OUTTIME|AGE_AT_ICU_ADMISSION|GENDER_BINARY|CAME_FROM_ER|HAS_INSURANCE|ADMISSION_TYPE_ENCODED|ETHNICITY_ENCODED|MARITAL_STATUS_ENCODED|RELIGION_ENCODED|FIRST_UNIT_ENCODED|CHANGED_ICU_UNIT|
+----------+----------+-------+------------+-------------------+-------------------+--------------------+-------------+------------+-------------+----------------------+-----------------+----------------------+----------------+------------------+----------------+
|    231977|      8470| 184688|      0.9792|2174-09-01 18:14:58|2174-09-02 17:45:00|                  30|            0|           0|       

## Extracting Clinical Events

**Purpose**: Extract top 20 most common CHARTEVENTS as features for ML models.

**Process**:
1. **Identify**: Find 20 most frequent CHARTEVENTS (typically vital signs)
2. **Calculate**: Average value of each test in first 24 hours of ICU stay
3. **Handle Missing**: Set missing values to **-1** (not null) for ML compatibility

**Time Window**: First 24 hours after ICU admission (INTIME + 24h)

**Result**: 20 vital signs features with consistent **-1** encoding for missing data, ensuring ML algorithm compatibility.


In [206]:
print("📊 Identifying top 20 most frequent tests from CHARTEVENTS...")


# Get frequency count of each ITEMID in CHARTEVENTS
itemid_counts = chartevents_df \
    .filter(col("ICUSTAY_ID").isNotNull()) \
    .filter(col("VALUENUM").isNotNull()) \
    .filter(col("CHARTTIME").isNotNull()) \
    .groupBy("ITEMID") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(20) \
    .collect()

# Create mapping dictionary for top 20 items
top_20_items = {row["ITEMID"]: f"VITAL_{row['ITEMID']}" for row in itemid_counts}
print(f"🎯 Top 20 chart items selected: {top_20_items}")

print("📊 Filtering CHARTEVENTS for top 20 items...")

chartevents_top20 = chartevents_df \
    .filter(col("ITEMID").isin(list(top_20_items.keys()))) \
    .filter(col("VALUENUM").isNotNull()) \
    .filter(col("VALUENUM").isNotNull()) \
    .filter(col("ICUSTAY_ID").isNotNull()) \
    .filter(col("CHARTTIME").isNotNull()) \
    .join(base_icu_df.select("ICUSTAY_ID", "ICU_INTIME", "ICU_OUTTIME"), "ICUSTAY_ID", "inner") \
    .filter(col("CHARTTIME").between(col("ICU_INTIME"), col("ICU_OUTTIME"))) \
    .select("ICUSTAY_ID", "ITEMID", "CHARTTIME", "VALUENUM")

# Process first 24 hours
vitals_24h_top20 = chartevents_top20.alias("ce") \
    .join(base_icu_df.select("ICUSTAY_ID", "ICU_INTIME"), "ICUSTAY_ID", "inner") \
    .filter(
        col("ce.CHARTTIME").between(
            col("ICU_INTIME"), 
            col("ICU_INTIME") + expr("INTERVAL 24 HOURS")
        )
    )

print("📊 Calculating aggregates for top 20 vitals...")

# Initialize with ICUSTAY_ID
vitals_features_top20 = base_icu_df.select("ICUSTAY_ID")

# Process each vital sign
for itemid, name in top_20_items.items():
    #print(f"Processing {name} (ITEMID={itemid})...")
    
    vital_stats = vitals_24h_top20 \
        .filter(col("ITEMID") == itemid) \
        .groupBy("ICUSTAY_ID") \
        .agg(avg("VALUENUM").alias(f"{name}_AVG"))
    
    # Left join (without filling NULLs yet)
    vitals_features_top20 = vitals_features_top20.join(vital_stats, "ICUSTAY_ID", "left")

# Cleanup
chartevents_df.unpersist()
vitals_24h_top20.unpersist()

# Verify no NULLs remain
print(f"✅ Created {len(top_20_items)} features from top 20 vital signs")
vitals_features_top20.show(5)

📊 Identifying top 20 most frequent tests from CHARTEVENTS...
🎯 Top 20 chart items selected: {220045: 'VITAL_220045', 220277: 'VITAL_220277', 220210: 'VITAL_220210', 220181: 'VITAL_220181', 220179: 'VITAL_220179', 220180: 'VITAL_220180', 211: 'VITAL_211', 742: 'VITAL_742', 618: 'VITAL_618', 646: 'VITAL_646', 223901: 'VITAL_223901', 220739: 'VITAL_220739', 223900: 'VITAL_223900', 220052: 'VITAL_220052', 220050: 'VITAL_220050', 220051: 'VITAL_220051', 223753: 'VITAL_223753', 8441: 'VITAL_8441', 455: 'VITAL_455', 456: 'VITAL_456'}
📊 Filtering CHARTEVENTS for top 20 items...
📊 Calculating aggregates for top 20 vitals...
✅ Created 20 features from top 20 vital signs


                                                                                

+----------+-----------------+----------------+------------------+-----------------+------------------+-----------------+-------------+-------------+-------------+-------------+-----------------+------------------+----------------+-----------------+------------------+-----------------+-----------------+--------------+-------------+-------------+
|ICUSTAY_ID| VITAL_220045_AVG|VITAL_220277_AVG|  VITAL_220210_AVG| VITAL_220181_AVG|  VITAL_220179_AVG| VITAL_220180_AVG|VITAL_211_AVG|VITAL_742_AVG|VITAL_618_AVG|VITAL_646_AVG| VITAL_223901_AVG|  VITAL_220739_AVG|VITAL_223900_AVG| VITAL_220052_AVG|  VITAL_220050_AVG| VITAL_220051_AVG| VITAL_223753_AVG|VITAL_8441_AVG|VITAL_455_AVG|VITAL_456_AVG|
+----------+-----------------+----------------+------------------+-----------------+------------------+-----------------+-------------+-------------+-------------+-------------+-----------------+------------------+----------------+-----------------+------------------+-----------------+-----------------+

## Extracting Laboratory Events

**Purpose**: Extract top 20 most common lab tests as features for ML models.

**Process**:
1. **Identify**: Find 20 most frequent LABEVENTS (blood tests, chemistry panels)
2. **Time Window**: 6 hours before ICU admission + first 24 hours in ICU (30h total)
3. **Calculate**: Average value of each lab test within the 30-hour window

**Time Range**: ICU_INTIME - 6h to ICU_INTIME + 24h

**Result**: 20 lab test features with consistent -1 encoding for missing data, capturing pre-ICU and early ICU clinical status.

In [207]:
print("\n🧪 Creating laboratory features from LABEVENTS...")

# Step 1: Identify top 20 most frequent lab items
print("📊 Identifying top 20 most frequent lab items...")
top_20_lab_items = labevents_df \
    .filter(col("HADM_ID").isin([row["HADM_ID"] for row in base_icu_df.select("HADM_ID").collect()])) \
    .filter(col("VALUENUM").isNotNull()) \
    .filter(col("VALUENUM") > 0) \
    .groupBy("ITEMID") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(20) \
    .collect()

# Create mapping dictionary with clean LAB_[ITEMID] format
lab_items = {row["ITEMID"]: f"LAB_{row['ITEMID']}" for row in top_20_lab_items}
print(f"🎯 Top 20 lab items selected: {list(lab_items.keys())}")

# Step 2: Filter lab events within first 24 hours of ICU stay
print("📊 Filtering LABEVENTS for top 20 items...")
labs_24h = labevents_df.alias("le") \
    .join(base_icu_df.select("ICUSTAY_ID", "HADM_ID", "ICU_INTIME"), "HADM_ID", "inner") \
    .filter(col("le.ITEMID").isin(list(lab_items.keys()))) \
    .filter(col("le.VALUENUM").isNotNull()) \
    .filter(col("le.VALUENUM") > 0) \
    .filter(
        col("le.CHARTTIME").between(
            col("ICU_INTIME") - expr("INTERVAL 6 HOURS"),  # Include pre-ICU labs
            col("ICU_INTIME") + expr("INTERVAL 24 HOURS")
        )
    )

# Step 3: Calculate lab statistics with clean column names
print("📊 Calculating laboratory statistics...")
labs_features = base_icu_df.select("ICUSTAY_ID")

for itemid, name in lab_items.items():
    item_stats = labs_24h \
        .filter(col("ITEMID") == itemid) \
        .groupBy("ICUSTAY_ID") \
        .agg(
            avg("VALUENUM").alias(f"{name}_AVG")  # Simple alias without coalesce in the name
        )
    
    labs_features = labs_features.join(item_stats, "ICUSTAY_ID", "left")

# Cleanup
labevents_df.unpersist()
labs_24h.unpersist()

print(f"✅ Created {len(lab_items)} lab features for {labs_features.count():,} ICU stays")

# Show sample of features with clean column names
print("📊 Sample features:")
labs_features.select(
    "ICUSTAY_ID",
    *[col for col in labs_features.columns if col != "ICUSTAY_ID"]
).show(5, truncate=False)


🧪 Creating laboratory features from LABEVENTS...
📊 Identifying top 20 most frequent lab items...
🎯 Top 20 lab items selected: [51221, 50983, 51301, 51265, 50971, 51248, 51250, 51222, 51249, 51279, 51277, 50912, 51006, 50902, 50882, 50868, 50931, 50960, 51275, 51237]
📊 Filtering LABEVENTS for top 20 items...
📊 Calculating laboratory statistics...
✅ Created 20 lab features for 12 ICU stays
📊 Sample features:
+----------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+-------------+------------------+------------------+------------------+-------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|ICUSTAY_ID|LAB_51221_AVG     |LAB_50983_AVG     |LAB_51301_AVG    |LAB_51265_AVG     |LAB_50971_AVG     |LAB_51248_AVG     |LAB_51250_AVG    |LAB_51222_AVG     |LAB_51249_AVG|LAB_51279_AVG     |LAB_51277_AVG

## Diagnosis ICD

**Purpose**: Extract diagnosis patterns as ML features from ICD-9 codes.

**Process**:
1. **Top 3**: Get top 3 diagnoses by person, using HADM_ID, to future join with other tables. 
2. **Encode**: Encode the ICD9 diagnoses into a wide range of diagnoses.
3. **Pivot**: Pivot to create the 3 columns with the encoded diagnose type.
4. **Handle Missing Values**: Input -1 in the NULL entries of the table.


**Features Created**:
- **TOTAL_DIAGNOSES**: Count of all diagnoses (comorbidity indicator)
- **PRIMARY_DIAGNOSIS**: Most significant diagnose, encoded.
- **SECONDARY_DIAGNOSIS**: Second most significant diagnose, encoded.
- **TERCIARY_DIAGNOSIS**: Third most significant diagnose, encoded.

**Result**: ??????????????????????????????????????????????????????????

In [208]:
def icd9_to_chapter(code):
    # Convert to string and clean
    code_str = str(code).strip()
    
    # Handle V codes (supplementary classification)
    if code_str.startswith('V'):
        return 18 #'Supplemental'
    
    # Handle E codes (external causes of injury)
    if code_str.startswith('E'):
        return 19 #'External_Injury'
    
    # Extract first 3 digits for numeric codes
    try:
        # Handle codes like '4280' (convert to 428) or '486' (stays 486)
        numeric_part = code_str.split('.')[0] if '.' in code_str else code_str
        code_num = float(numeric_part[:3])
    except:
        return 0 #'Unknown'
    
    # Map to chapters
    if 1 <= code_num <= 139: return 1 #'Infectious'
    elif 140 <= code_num <= 239: return 2 # 'Neoplasms'
    elif 240 <= code_num <= 279: return 3 #'Endocrine'
    elif 280 <= code_num <= 289: return 4 #'Blood'
    elif 290 <= code_num <= 319: return 5 #'Mental'
    elif 320 <= code_num <= 389: return 6 #'Nervous'
    elif 390 <= code_num <= 459: return 7 #'Circulatory'
    elif 460 <= code_num <= 519: return 8 #'Respiratory'
    elif 520 <= code_num <= 579: return 9 #'Digestive'
    elif 580 <= code_num <= 629: return 10 #'Genitourinary'
    elif 630 <= code_num <= 679: return 11 #'Pregnancy'
    elif 680 <= code_num <= 709: return 12 #'Skin'
    elif 710 <= code_num <= 739: return 13 #'Musculoskeletal'
    elif 740 <= code_num <= 759: return 14 #'Congenital'
    elif 760 <= code_num <= 779: return 15 #'Perinatal'
    elif 780 <= code_num <= 799: return 16 #'Ill-defined'
    elif 800 <= code_num <= 999: return 17 #'Injury'
    else: return 20 #'Other' 

In [209]:
print("\n🏥 Creating diagnosis features (optimized pipeline)...")

# 1. First filter to only top 3 diagnoses per admission
window_spec = Window.partitionBy("HADM_ID").orderBy("SEQ_NUM")

top_3_filtered = diagnoses_df \
    .withColumn("row_num", row_number().over(window_spec)) \
    .filter(col("row_num") <= 3) \
    .cache()

# 2. Register UDF with Integer return type
icd9_chapter_udf = udf(icd9_to_chapter, IntegerType())  # Changed to IntegerType

# 3. Encode ONLY the top 3 diagnoses
top_3_encoded = top_3_filtered.withColumn(
    "DISEASE_CHAPTER", 
    icd9_chapter_udf(col("ICD9_CODE"))
)

# 4. Pivot to create columns
diagnosis_features = top_3_encoded \
    .groupBy("HADM_ID") \
    .pivot("row_num", [1, 2, 3]) \
    .agg(first("DISEASE_CHAPTER")) \
    .select(
        "HADM_ID",
        col("1").alias("PRIMARY_DIAGNOSIS").cast(IntegerType()),
        col("2").alias("SECONDARY_DIAGNOSIS").cast(IntegerType()),
        col("3").alias("TERTIARY_DIAGNOSIS").cast(IntegerType())
    ) \
    .join(diagnosis_counts, "HADM_ID", "left")

# 5. Fill NULLs and ensure consistent types
diagnosis_features = diagnosis_features.fillna(-1, subset=[
    "PRIMARY_DIAGNOSIS",
    "SECONDARY_DIAGNOSIS",
    "TERTIARY_DIAGNOSIS"
])


print("📊 Optimized diagnosis features:")
diagnosis_features.select(
    "HADM_ID",
    "TOTAL_DIAGNOSES",
    "PRIMARY_DIAGNOSIS",
    "SECONDARY_DIAGNOSIS",
    "TERTIARY_DIAGNOSIS"
).show(20, truncate=False)

print(f"⏰ Completed in: {time.time() - start_time:.2f}s")


🏥 Creating diagnosis features (optimized pipeline)...
📊 Optimized diagnosis features:


25/06/05 13:58:35 WARN CacheManager: Asked to cache already cached data.

+-------+---------------+-----------------+-------------------+------------------+
|HADM_ID|TOTAL_DIAGNOSES|PRIMARY_DIAGNOSIS|SECONDARY_DIAGNOSIS|TERTIARY_DIAGNOSIS|
+-------+---------------+-----------------+-------------------+------------------+
|152943 |7              |7                |6                  |6                 |
|163177 |7              |9                |8                  |5                 |
|110159 |12             |17               |9                  |8                 |
|109820 |11             |1                |8                  |8                 |
|181763 |12             |7                |17                 |4                 |
|150954 |6              |7                |8                  |18                |
|177309 |16             |1                |12                 |10                |
|110972 |13             |17               |7                  |17                |
|197549 |15             |17               |7                  |17                |
|109

                                                                                

## Joining All Features

In [210]:
print("📊 Joining all features and selecting final features for regression modeling...")

# Define feature columns to exclude
exclude_columns = {"ICUSTAY_ID", "HADM_ID", "SUBJECT_ID", "ICU_INTIME", "ICU_OUTTIME"}

# Join all features and immediately select desired columns
modeling_dataset = base_icu_df \
    .join(vitals_features_top20, "ICUSTAY_ID", "left") \
    .join(labs_features, "ICUSTAY_ID", "left") \
    .join(diagnosis_features, "HADM_ID", "left") \
    .select(*[name for name in base_icu_df \
        .join(vitals_features_top20, "ICUSTAY_ID", "left") \
        .join(labs_features, "ICUSTAY_ID", "left") \
        .join(diagnosis_features, "HADM_ID", "left") \
        .columns if name not in exclude_columns])

# Cleanup
base_icu_df.unpersist()
vitals_features_top20.unpersist()
labs_features.unpersist()
diagnosis_features.unpersist()

# Display final info
print(f"✅ Final modeling dataset created with {modeling_dataset.count()} records")
print("📋 Sample of final modeling dataset:")
modeling_dataset.show(5, truncate=False)


📊 Joining all features and selecting final features for regression modeling...
✅ Final modeling dataset created with 12 records
📋 Sample of final modeling dataset:


25/06/05 13:58:43 WARN DAGScheduler: Broadcasting large task binary with size 1671.0 KiB


+------------+--------------------+-------------+------------+-------------+----------------------+-----------------+----------------------+----------------+------------------+----------------+-----------------+----------------+------------------+-----------------+-----------------+-----------------+------------------+-------------+------------------+-----------------+-----------------+------------------+----------------+-----------------+------------------+-----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+-------------+------------------+------------------+------------------+-------------+------------------+-------------+------------------+------------------+------------------+-------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+-------------------+------------------+---------------+
|ICU_LOS_DAYS|

## Normalization & Handeling Missing Values

Display Missing Values by column.

In [211]:

null_counts = modeling_dataset.select(
    [spark_sum(col(c).isNull().cast("int")).alias(c) for c in modeling_dataset.columns]
).collect()[0]


null_counts_dict = {col: null_counts[col] for col in modeling_dataset.columns}
print(null_counts_dict)

25/06/05 13:58:49 WARN DAGScheduler: Broadcasting large task binary with size 1774.2 KiB


{'ICU_LOS_DAYS': 0, 'AGE_AT_ICU_ADMISSION': 0, 'GENDER_BINARY': 0, 'CAME_FROM_ER': 0, 'HAS_INSURANCE': 0, 'ADMISSION_TYPE_ENCODED': 0, 'ETHNICITY_ENCODED': 0, 'MARITAL_STATUS_ENCODED': 0, 'RELIGION_ENCODED': 0, 'FIRST_UNIT_ENCODED': 0, 'CHANGED_ICU_UNIT': 0, 'VITAL_220045_AVG': 2, 'VITAL_220277_AVG': 2, 'VITAL_220210_AVG': 2, 'VITAL_220181_AVG': 3, 'VITAL_220179_AVG': 3, 'VITAL_220180_AVG': 3, 'VITAL_211_AVG': 10, 'VITAL_742_AVG': 10, 'VITAL_618_AVG': 10, 'VITAL_646_AVG': 10, 'VITAL_223901_AVG': 2, 'VITAL_220739_AVG': 2, 'VITAL_223900_AVG': 2, 'VITAL_220052_AVG': 7, 'VITAL_220050_AVG': 7, 'VITAL_220051_AVG': 7, 'VITAL_223753_AVG': 7, 'VITAL_8441_AVG': 11, 'VITAL_455_AVG': 11, 'VITAL_456_AVG': 11, 'LAB_51221_AVG': 0, 'LAB_50983_AVG': 0, 'LAB_51301_AVG': 0, 'LAB_51265_AVG': 0, 'LAB_50971_AVG': 0, 'LAB_51248_AVG': 0, 'LAB_51250_AVG': 0, 'LAB_51222_AVG': 0, 'LAB_51249_AVG': 0, 'LAB_51279_AVG': 0, 'LAB_51277_AVG': 0, 'LAB_50912_AVG': 0, 'LAB_51006_AVG': 0, 'LAB_50902_AVG': 0, 'LAB_50882_AVG

we chose min max std beacause -1 will be corresponding to missing values, and if using standardization (aproximation to gaussian distribuction) -1 would correspond to an actual result of a test and not a outlier/ not existing test result. We only aplied this to float columns since others are binary or int(in case of age), final results have a max of 3 decimals

In [219]:
print("📊 Filling NULL entries with -1...")
std_columns = [c for c in modeling_dataset.columns if c.endswith('_AVG')]

modeling_dataset = modeling_dataset.na.fill(-1)

print("📊 Computing min-max scaling in _AVG columns, excluding -1 entries...")
min_max_values = {}
for col_name in std_columns:
    stats = modeling_dataset.filter(col(col_name) != -1.0).agg(
        spark_min(col(col_name)).alias("min"),
        spark_max(col(col_name)).alias("max")
    ).first()
    min_max_values[col_name] = (stats["min"], stats["max"])

for col_name in std_columns:
    min_val, max_val = min_max_values[col_name]
    range_val = max_val - min_val if max_val != min_val else 1.0
    modeling_dataset = modeling_dataset.withColumn(
        col_name, 
        when(col(col_name) == -1.0, -1.0).otherwise(
            round((col(col_name) - min_val) / range_val, 5)
        )
    )

print("✅ Data set ready for Machine Learning!")
modeling_dataset.show(5, truncate=False)

num_rows = modeling_dataset.count()
num_cols = len(modeling_dataset.columns)
print(f"Final DataSet shape: ({num_rows}, {num_cols})")

📊 Filling NULL entries with -1...
📊 Computing min-max scaling in _AVG columns, excluding -1 entries...
✅ Data set ready for Machine Learning!


25/06/05 14:05:24 WARN DAGScheduler: Broadcasting large task binary with size 1917.2 KiB


+------------+--------------------+-------------+------------+-------------+----------------------+-----------------+----------------------+----------------+------------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+-------------+-------------+-------------+-------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-----------------+-------------------+------------------+---------------+
|ICU_LOS_DAYS|AGE_AT_ICU_ADMISSION|GENDER_BINARY|CAME_FROM_ER|HAS_INSURANCE|ADMISSION_TYPE_ENCODED|ETHNICITY_ENCODED|MARITAL_STATUS_E

# Machine Learning

## Preparing for Machine Learning

In [213]:
print("📊 Step 1: Creating train/test split...")
train_data, test_data = modeling_dataset.randomSplit([0.8, 0.2], seed=13)

print("✅ Data split completed.")
print(f"   🚆 Training samples: {train_data.count()}")
print(f"   🧪 Test samples: {test_data.count()}")


feature_columns = [col for col in modeling_dataset.columns if col != 'ICU_LOS_DAYS']
print("Feature columns:", feature_columns)
target_column = 'ICU_LOS_DAYS'
print("Target column:", target_column)

feature_assembler = VectorAssembler(
    inputCols=feature_columns,  
    outputCol="features"     
)

print("📊 Step 2: Creating the final vectorized train/test datasets...")
train_final = feature_assembler.transform(train_data).select(
    "features", 
    target_column
).withColumnRenamed(target_column, "label")

test_final = feature_assembler.transform(test_data).select(
    "features", 
    target_column
).withColumnRenamed(target_column, "label")

train_final.cache()
test_final.cache()

print("✅ Final datasets prepared:")
print(f"   🚆 Training features shape: ({train_final.count()}, {len(feature_columns)})")
print(f"   🧪 Test features shape: ({test_final.count()}, {len(feature_columns)})")

📊 Step 1: Creating train/test split...
✅ Data split completed.


25/06/05 13:59:24 WARN DAGScheduler: Broadcasting large task binary with size 1789.7 KiB


   🚆 Training samples: 10


25/06/05 13:59:32 WARN DAGScheduler: Broadcasting large task binary with size 1789.7 KiB


   🧪 Test samples: 2
Feature columns: ['AGE_AT_ICU_ADMISSION', 'GENDER_BINARY', 'CAME_FROM_ER', 'HAS_INSURANCE', 'ADMISSION_TYPE_ENCODED', 'ETHNICITY_ENCODED', 'MARITAL_STATUS_ENCODED', 'RELIGION_ENCODED', 'FIRST_UNIT_ENCODED', 'CHANGED_ICU_UNIT', 'VITAL_220045_AVG', 'VITAL_220277_AVG', 'VITAL_220210_AVG', 'VITAL_220181_AVG', 'VITAL_220179_AVG', 'VITAL_220180_AVG', 'VITAL_211_AVG', 'VITAL_742_AVG', 'VITAL_618_AVG', 'VITAL_646_AVG', 'VITAL_223901_AVG', 'VITAL_220739_AVG', 'VITAL_223900_AVG', 'VITAL_220052_AVG', 'VITAL_220050_AVG', 'VITAL_220051_AVG', 'VITAL_223753_AVG', 'VITAL_8441_AVG', 'VITAL_455_AVG', 'VITAL_456_AVG', 'LAB_51221_AVG', 'LAB_50983_AVG', 'LAB_51301_AVG', 'LAB_51265_AVG', 'LAB_50971_AVG', 'LAB_51248_AVG', 'LAB_51250_AVG', 'LAB_51222_AVG', 'LAB_51249_AVG', 'LAB_51279_AVG', 'LAB_51277_AVG', 'LAB_50912_AVG', 'LAB_51006_AVG', 'LAB_50902_AVG', 'LAB_50882_AVG', 'LAB_50868_AVG', 'LAB_50931_AVG', 'LAB_50960_AVG', 'LAB_51275_AVG', 'LAB_51237_AVG', 'PRIMARY_DIAGNOSIS', 'SECONDARY_

25/06/05 13:59:40 WARN DAGScheduler: Broadcasting large task binary with size 1848.9 KiB
25/06/05 13:59:40 WARN DAGScheduler: Broadcasting large task binary with size 1854.3 KiB
                                                                                

   🚆 Training features shape: (9, 54)


25/06/05 13:59:47 WARN DAGScheduler: Broadcasting large task binary with size 1848.9 KiB
25/06/05 13:59:48 WARN DAGScheduler: Broadcasting large task binary with size 1854.3 KiB


   🧪 Test features shape: (3, 54)


## Training Multiple Models

In [214]:
print("📊 Step 1: Setting up evaluation metrics...")

# Create regression evaluators
rmse_evaluator = RegressionEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="rmse"
)

mae_evaluator = RegressionEvaluator(
    labelCol="label",
    predictionCol="prediction", 
    metricName="mae"
)

r2_evaluator = RegressionEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="r2"
)

print("✅ Evaluation metrics configured: RMSE, MAE, R²")

📊 Step 1: Setting up evaluation metrics...
✅ Evaluation metrics configured: RMSE, MAE, R²


### Linear Regression

In [215]:
print("\n📈 Step 2: Training Linear Regression model...")
print(f"🕐 Started at: {datetime.now().strftime('%H:%M:%S')}")
start_time = time.time()

# Create Linear Regression model
lr = LinearRegression(
    featuresCol="features",
    labelCol="label",
    maxIter=200,                    # Increased for better convergence
    regParam=0.001,                 # Lower regularization for healthcare data
    elasticNetParam=0.1,            # Slight L1 penalty for feature selection
    tol=1e-8,                       # Tighter tolerance for precision
    standardization=False,          # We're doing manual scaling
    fitIntercept=True,
    aggregationDepth=3,             # Better for distributed training
    loss="squaredError",
    solver="normal"                 # Best for small-medium datasets
)


# Train the model
print("   🔄 Training Linear Regression...")
lr_model = lr.fit(train_final)

print("   🔄 Linear Regression - Making predictions (test data)...")
lr_predictions = lr_model.transform(test_final)

print("   🔄 Linear Regression - Evaluation...")
lr_rmse = rmse_evaluator.evaluate(lr_predictions)
lr_mae = mae_evaluator.evaluate(lr_predictions)
lr_r2 = r2_evaluator.evaluate(lr_predictions)

print(f"✅ Linear Regression Results:")
print(f"   📉 RMSE: {lr_rmse:.3f} days")
print(f"   📊 MAE: {lr_mae:.3f} days")
print(f"   📈 R²: {lr_r2:.3f}")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"🕐 Completed at: {datetime.now().strftime('%H:%M:%S')}")
print(f"⏱️ Total elapsed time: {elapsed_time:.2f} seconds")


📈 Step 2: Training Linear Regression model...
🕐 Started at: 13:59:49
   🔄 Training Linear Regression...


25/06/05 13:59:50 WARN DAGScheduler: Broadcasting large task binary with size 1870.7 KiB
25/06/05 13:59:50 WARN DAGScheduler: Broadcasting large task binary with size 1871.7 KiB
25/06/05 13:59:51 WARN DAGScheduler: Broadcasting large task binary with size 1874.8 KiB
25/06/05 13:59:52 WARN DAGScheduler: Broadcasting large task binary with size 1875.9 KiB
                                                                                

   🔄 Linear Regression - Making predictions (test data)...
   🔄 Linear Regression - Evaluation...


25/06/05 13:59:53 WARN DAGScheduler: Broadcasting large task binary with size 1876.9 KiB
25/06/05 13:59:53 WARN DAGScheduler: Broadcasting large task binary with size 1878.0 KiB
25/06/05 13:59:54 WARN DAGScheduler: Broadcasting large task binary with size 1876.9 KiB
25/06/05 13:59:55 WARN DAGScheduler: Broadcasting large task binary with size 1878.0 KiB
25/06/05 13:59:55 WARN DAGScheduler: Broadcasting large task binary with size 1876.9 KiB


✅ Linear Regression Results:
   📉 RMSE: 5.078 days
   📊 MAE: 3.272 days
   📈 R²: -1.120
🕐 Completed at: 13:59:56
⏱️ Total elapsed time: 7.67 seconds


25/06/05 13:59:56 WARN DAGScheduler: Broadcasting large task binary with size 1878.0 KiB
                                                                                

### Random Forest

In [216]:

print("\n🌲 Step 3: Training Random Forest model...")
print(f"🕐 Started at: {datetime.now().strftime('%H:%M:%S')}")
start_time = time.time()

# Create Random Forest model
rf = RandomForestRegressor(
    featuresCol="features",
    labelCol="label",
    numTrees=200,                   # More trees = better accuracy (if enough cores/memory)
    maxDepth=12,                    # Deeper trees capture more complexity
    minInstancesPerNode=2,          # Allows more granular splits
    subsamplingRate=0.9,            # Slightly higher sample rate for stability
    featureSubsetStrategy="sqrt",   # Good default for regression
    seed=42                         # Reproducibility
)

print("   🔄 Training Random Forest...")
rf_model = rf.fit(train_final)

print("   🔄 Random Forest - Making predictions (test data)...")
rf_predictions = rf_model.transform(test_final)

print("   🔄 Random Forest - Evaluation...")
rf_rmse = rmse_evaluator.evaluate(rf_predictions)
rf_mae = mae_evaluator.evaluate(rf_predictions)
rf_r2 = r2_evaluator.evaluate(rf_predictions)

print(f"✅ Random Forest Results:")
print(f"   📉 RMSE: {rf_rmse:.3f} days")
print(f"   📊 MAE: {rf_mae:.3f} days")
print(f"   📈 R²: {rf_r2:.3f}")


end_time = time.time()
elapsed_time = end_time - start_time
print(f"🕐 Completed at: {datetime.now().strftime('%H:%M:%S')}")
print(f"⏱️ Total elapsed time: {elapsed_time:.2f} seconds")



🌲 Step 3: Training Random Forest model...
🕐 Started at: 13:59:56
   🔄 Training Random Forest...


25/06/05 13:59:57 WARN DAGScheduler: Broadcasting large task binary with size 1868.8 KiB
25/06/05 13:59:57 WARN DAGScheduler: Broadcasting large task binary with size 1868.8 KiB
25/06/05 13:59:57 WARN DAGScheduler: Broadcasting large task binary with size 1868.8 KiB
25/06/05 13:59:58 WARN DAGScheduler: Broadcasting large task binary with size 1868.7 KiB
25/06/05 13:59:58 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 9 (= number of training instances)
25/06/05 13:59:58 WARN DAGScheduler: Broadcasting large task binary with size 1871.4 KiB
25/06/05 13:59:59 WARN DAGScheduler: Broadcasting large task binary with size 1899.4 KiB
25/06/05 14:00:00 WARN DAGScheduler: Broadcasting large task binary with size 1961.3 KiB
25/06/05 14:00:00 WARN DAGScheduler: Broadcasting large task binary with size 1986.6 KiB
25/06/05 14:00:01 WARN DAGScheduler: Broadcasting large task binary with size 1941.5 KiB
25/06/05 14:00:02 WARN DAGScheduler: Broadcasting large task binary with size 

   🔄 Random Forest - Making predictions (test data)...
   🔄 Random Forest - Evaluation...


25/06/05 14:00:04 WARN DAGScheduler: Broadcasting large task binary with size 1871.4 KiB
25/06/05 14:00:04 WARN DAGScheduler: Broadcasting large task binary with size 1872.5 KiB
25/06/05 14:00:05 WARN DAGScheduler: Broadcasting large task binary with size 1871.4 KiB
25/06/05 14:00:05 WARN DAGScheduler: Broadcasting large task binary with size 1872.5 KiB
25/06/05 14:00:06 WARN DAGScheduler: Broadcasting large task binary with size 1871.4 KiB


✅ Random Forest Results:
   📉 RMSE: 4.207 days
   📊 MAE: 2.719 days
   📈 R²: -0.455
🕐 Completed at: 14:00:07
⏱️ Total elapsed time: 10.57 seconds


25/06/05 14:00:07 WARN DAGScheduler: Broadcasting large task binary with size 1872.5 KiB


## Model Predictions

In [217]:
evaluator_r2 = RegressionEvaluator(metricName="r2")

print("\n📈 Linear Regression Predictions:")
lr_display = lr_predictions.select(
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
)

lr_display.show(truncate=False)



# Random Forest Predictions
print("\n🌲 Random Forest Predictions:")
rf_display = rf_predictions.select(
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
)

rf_display.show(truncate=False)


📈 Linear Regression Predictions:


25/06/05 14:00:08 WARN DAGScheduler: Broadcasting large task binary with size 1868.0 KiB
                                                                                

+----------+-------------+--------------+-------------+
|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+-------------+--------------+-------------+
|1.2597    |2.244        |0.985         |78.17        |
|8.9163    |0.176        |8.74          |98.02        |
|1.8064    |1.716        |0.09          |5.01         |
+----------+-------------+--------------+-------------+


🌲 Random Forest Predictions:


25/06/05 14:00:09 WARN DAGScheduler: Broadcasting large task binary with size 1862.4 KiB


+----------+-------------+--------------+-------------+
|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+-------------+--------------+-------------+
|1.2597    |1.97         |0.71          |56.39        |
|8.9163    |1.668        |7.249         |81.3         |
|1.8064    |1.608        |0.198         |10.97        |
+----------+-------------+--------------+-------------+



## Model Comparison

In [218]:
print("\n🏆 Step 5: Model Performance Comparison...")

# Create comparison summary
results_data = [
    ("Linear Regression", lr_rmse, lr_mae, lr_r2),
    ("Random Forest", rf_rmse, rf_mae, rf_r2)
]

results_df = spark.createDataFrame(results_data, ["Model", "RMSE", "MAE", "R2"])

print("📊 Model Performance Summary:")
results_df.show(truncate=False)


best_rmse_model = builtins.min(results_data, key=operator.itemgetter(1))
best_r2_model = builtins.max(results_data, key=operator.itemgetter(3))

print(f"\n🥇 Best Models:")
print(f"   🎯 Lowest RMSE: {best_rmse_model[0]} ({best_rmse_model[1]:.3f} days)")
print(f"   📈 Highest R²: {best_r2_model[0]} ({best_r2_model[3]:.3f})")


🏆 Step 5: Model Performance Comparison...
📊 Model Performance Summary:
+-----------------+-----------------+------------------+-------------------+
|Model            |RMSE             |MAE               |R2                 |
+-----------------+-----------------+------------------+-------------------+
|Linear Regression|5.078344674303861|3.2717855085856535|-1.1202130831027075|
|Random Forest    |4.206614476189922|2.7190549194444436|-0.4547909365739031|
+-----------------+-----------------+------------------+-------------------+


🥇 Best Models:
   🎯 Lowest RMSE: Random Forest (4.207 days)
   📈 Highest R²: Random Forest (-0.455)
