# ICU Length of Stay Prediction - MIMIC-III Pipeline

## 🎯 Objective
Predict ICU stay duration using PySpark ML on MIMIC-III dataset

## 📊 Data & Constraints
- **Sources**: 6 MIMIC-III tables (CHARTEVENTS, LABEVENTS, ICUSTAYS, etc.)
- **Filters**: 
        - Patient Age 18-80
        - LOS 0.1-15 days
        - Valid time sequences
- **Timeframe**: Vitals (first 24h), Labs (6h pre to 24h post ICU)

## 🔧 Features (39 total)
- **Demographics (2)**: Age, gender
- **Admission (8)**: Emergency/elective, timing, insurance
- **ICU Units (6)**: Care unit types, transfers
- **Vitals (11)**: HR, BP, RR, temp, SpO2 (avg/std)
- **Labs (8)**: Creatinine, glucose, electrolytes, blood counts
- **Diagnoses (4)**: Total count, sepsis, respiratory failure

## 🤖 Models & Results
- **Linear Regression**: 
- **Random Forest**: 

## ☁️ Infrastructure
- **GCP Dataproc**: 6x e2-highmem-4 workers (28 vCPUs, 224GB RAM)
- **Optimizations**: Smart sampling, aggressive filtering, 80/20 split





## Import Libraries

In [1]:
# Core PySpark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Machine Learning imports
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer
from pyspark.ml.regression import RandomForestRegressor, LinearRegression #, GBTRegressor
#from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from pyspark.ml.evaluation import RegressionEvaluator #, MulticlassClassificationEvaluator
#from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
#from pyspark.ml import Pipeline

from datetime import datetime, timedelta
import time

print("✅ All imports loaded successfully!")
print(f"⏰ Notebook started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ All imports loaded successfully!
⏰ Notebook started at: 2025-06-01 11:06:06


## Setup Spark Session

In [2]:
spark = SparkSession.builder \
        .appName("Forecast-LOS") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.sql.shuffle.partitions", "400") \
        .config("spark.sql.adaptive.skewJoin.enabled", "true") \
        .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
        .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "256MB") \
        .config("spark.sql.adaptive.coalescePartitions.minPartitionSize", "128MB") \
        .config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB") \
        .config("spark.sql.files.maxPartitionBytes", "128MB") \
        .config("spark.network.timeout", "800s") \
        .config("spark.executor.heartbeatInterval", "60s") \
        .config("spark.executor.memory", "24g") \
        .config("spark.executor.cores", "4") \
        .config("spark.executor.instances", "12") \
        .config("spark.driver.memory", "8g") \
        .getOrCreate()

print("✅ Spark session created successfully!")
print(f"📊 Spark Version: {spark.version}")
print(f"🔧 Application Name: {spark.sparkContext.appName}")
print(f"💾 Available cores: {spark.sparkContext.defaultParallelism}")
print(f"\n⏰ Spark session initialised at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/01 11:06:09 INFO SparkEnv: Registering MapOutputTracker
25/06/01 11:06:10 INFO SparkEnv: Registering BlockManagerMaster
25/06/01 11:06:10 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/06/01 11:06:10 INFO SparkEnv: Registering OutputCommitCoordinator


✅ Spark session created successfully!
📊 Spark Version: 3.5.3
🔧 Application Name: Forecast-LOS
💾 Available cores: 2

⏰ Spark session initialised at: 2025-06-01 11:06:16


# Load Data

In [3]:
DEBUG = True

if DEBUG:
    MIMIC_PATH = "gs://dataproc-staging-europe-west4-719881989993-sa4vn92s/mimic-short"
else:
    MIMIC_PATH = "gs://dataproc-staging-europe-west4-719881989993-sa4vn92s/mimic-data"



print("🏥 Loading MIMIC-III CSV files...")


print("📂 Loading CHARTEVENTS table... [GZIP COMPRESSED]")
if DEBUG:
    chartevents_df = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/CHARTEVENTS.csv")
else:
    chartevents_df = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/CHARTEVENTS.csv.gz")




print("📂 Loading LABEVENTS table... [GZIP COMPRESSED]")
if DEBUG:
    labevents_df = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/LABEVENTS.csv")
else:
    labevents_df = spark.read.option("header", "true").option("inferSchema", "false").csv(f"{MIMIC_PATH}/LABEVENTS.csv.gz")


#print("📂 Loading INPUTEVENTS_MV table... [GZIP COMPRESSED]")
#inputevents_df = spark.read.option("header", "true") .option("inferSchema", "false") .csv(f"{MIMIC_PATH}/INPUTEVENTS_MV.csv.gz")



####

print("📂 Loading ICUSTAYS table...")
icustays_df = spark.read.option("header", "true") .option("inferSchema", "true") .csv(f"{MIMIC_PATH}/ICUSTAYS.csv")

print("📂 Loading PATIENTS table...")
patients_df = spark.read.option("header", "true") .option("inferSchema", "true") .csv(f"{MIMIC_PATH}/PATIENTS.csv")

print("📂 Loading ADMISSIONS table...")
admissions_df = spark.read.option("header", "true") .option("inferSchema", "true") .csv(f"{MIMIC_PATH}/ADMISSIONS.csv")

print("📂 Loading DIAGNOSES_ICD table...")
diagnoses_df = spark.read.option("header", "true") .option("inferSchema", "true") .csv(f"{MIMIC_PATH}/DIAGNOSES_ICD.csv")



# Display basic information about loaded tables
print("\n✅ Tables loaded successfully!")
#print(f"📊 ICUSTAYS: {icustays_df.count():,} rows × {len(icustays_df.columns)} columns")
#print(f"📊 PATIENTS: {patients_df.count():,} rows × {len(patients_df.columns)} columns") 
#print(f"📊 ADMISSIONS: {admissions_df.count():,} rows × {len(admissions_df.columns)} columns")
#print(f"📊 CHARTEVENTS: {chartevents_df.count():,} rows × {len(chartevents_df.columns)} columns")
#print(f"📊 LABEVENTS: {labevents_df.count():,} rows × {len(labevents_df.columns)} columns")
#print(f"📊 INPUTEVENTS_MV: {inputevents_df.count():,} rows × {len(inputevents_df.columns)} columns")



print(f"\n⏰ Data loaded at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🏥 Loading MIMIC-III CSV files...
📂 Loading CHARTEVENTS table... [GZIP COMPRESSED]


                                                                                

📂 Loading LABEVENTS table... [GZIP COMPRESSED]


                                                                                

📂 Loading ICUSTAYS table...


                                                                                

📂 Loading PATIENTS table...


                                                                                

📂 Loading ADMISSIONS table...


                                                                                

📂 Loading DIAGNOSES_ICD table...


[Stage 9:>                                                          (0 + 1) / 1]


✅ Tables loaded successfully!

⏰ Data loaded at: 2025-06-01 11:06:48


                                                                                

## Features Engineering

Current features for regression:

- Demographics (age, gender)
- Admission characteristics (emergency vs elective, timing)
- ICU unit types and transfers
- Time-based features (weekend, night admissions)
- Medical data


## Extracting Data From ICUSTAYS

In [4]:
print("📊 Step 1: Creating base ICU dataset with patient demographics...")

base_icu_df = icustays_df.alias("icu") \
    .join(patients_df.alias("pat"), "SUBJECT_ID", "inner") \
    .join(admissions_df.alias("adm"), ["SUBJECT_ID", "HADM_ID"], "inner") \
    .select(
        # ICU stay identifiers
        col("icu.ICUSTAY_ID"),
        col("icu.SUBJECT_ID"), 
        col("icu.HADM_ID"),
        
        # Target variable - Length of Stay in ICU (days)
        col("icu.LOS").alias("ICU_LOS_DAYS"),
        
        # ICU characteristics
        col("icu.FIRST_CAREUNIT"),
        col("icu.LAST_CAREUNIT"), 
        col("icu.INTIME").alias("ICU_INTIME"),
        col("icu.OUTTIME").alias("ICU_OUTTIME"),
        
        # Patient demographics
        col("pat.GENDER"),
        col("pat.DOB"),
        col("pat.EXPIRE_FLAG").alias("PATIENT_DIED"),
        
        # Admission details
        col("adm.ADMITTIME"),
        col("adm.DISCHTIME"), 
        col("adm.ADMISSION_TYPE"),
        col("adm.ADMISSION_LOCATION"),
        col("adm.INSURANCE"),
        col("adm.ETHNICITY"),
        col("adm.HOSPITAL_EXPIRE_FLAG").alias("HOSPITAL_DEATH"),
        col("adm.DIAGNOSIS").alias("ADMISSION_DIAGNOSIS")
    )

# Calculate age at ICU admission
base_icu_df = base_icu_df.withColumn("AGE_AT_ICU_ADMISSION", \
                                     floor(datediff(col("ICU_INTIME"), col("DOB")) / 365.25)) \
                                     .filter(col("AGE_AT_ICU_ADMISSION").between(18,80))

📊 Step 1: Creating base ICU dataset with patient demographics...


## Extracting Categorical Features

In [5]:
print("📊 Step 2: Engineering categorical features...")

base_icu_df = base_icu_df \
    .withColumn("GENDER_BINARY", when(col("GENDER") == "M", 1).otherwise(0)) \
    .withColumn("IS_EMERGENCY_ADMISSION", 
                when(col("ADMISSION_TYPE") == "EMERGENCY", 1).otherwise(0)) \
    .withColumn("IS_ELECTIVE_ADMISSION", 
                when(col("ADMISSION_TYPE") == "ELECTIVE", 1).otherwise(0)) \
    .withColumn("CAME_FROM_ER", 
                when(col("ADMISSION_LOCATION").contains("EMERGENCY"), 1).otherwise(0)) \
    .withColumn("HAS_MEDICARE", 
                when(col("INSURANCE") == "Medicare", 1).otherwise(0)) \
    .withColumn("IS_WHITE_ETHNICITY", 
                when(col("ETHNICITY").contains("WHITE"), 1).otherwise(0))

📊 Step 2: Engineering categorical features...


## Extracting ICU Unit Types

In [6]:
print("📊 Step 3: Creating ICU unit type features...")

base_icu_df = base_icu_df \
    .withColumn("FIRST_UNIT_MICU", 
                when(col("FIRST_CAREUNIT") == "MICU", 1).otherwise(0)) \
    .withColumn("FIRST_UNIT_SICU", 
                when(col("FIRST_CAREUNIT") == "SICU", 1).otherwise(0)) \
    .withColumn("FIRST_UNIT_CSRU", 
                when(col("FIRST_CAREUNIT") == "CSRU", 1).otherwise(0)) \
    .withColumn("FIRST_UNIT_CCU", 
                when(col("FIRST_CAREUNIT") == "CCU", 1).otherwise(0)) \
    .withColumn("FIRST_UNIT_TSICU", 
                when(col("FIRST_CAREUNIT") == "TSICU", 1).otherwise(0)) \
    .withColumn("CHANGED_ICU_UNIT", 
                when(col("FIRST_CAREUNIT") != col("LAST_CAREUNIT"), 1).otherwise(0))

📊 Step 3: Creating ICU unit type features...


## Extracting Time-based Features

In [7]:
print("📊 Step 4: Creating time-based features...")
base_icu_df = base_icu_df \
    .withColumn("ADMISSION_TO_ICU_HOURS", 
                (unix_timestamp("ICU_INTIME") - unix_timestamp("ADMITTIME")) / 3600) \
    .withColumn("ICU_LOS_HOURS", col("ICU_LOS_DAYS") * 24) \
    .withColumn("WEEKEND_ADMISSION", 
                when(dayofweek("ICU_INTIME").isin([1, 7]), 1).otherwise(0)) \
    .withColumn("NIGHT_ADMISSION", 
                when(hour("ICU_INTIME").between(20, 7), 1).otherwise(0)) \
    .filter(col("ICU_INTIME") < col("ICU_OUTTIME")) \
    .filter(col("ADMITTIME") <= col("ICU_INTIME")) \
    .filter(col("ICU_LOS_DAYS") > 0.04)

📊 Step 4: Creating time-based features...


## Remove Outliers (Excessive Length Of Stay)

In [8]:
print("📊 Step 5: Cleaning target variable...")

base_icu_df = base_icu_df.filter(col("ICU_LOS_DAYS").between(0.1, 15))

base_icu_df.cache()

📊 Step 5: Cleaning target variable...


25/06/01 11:06:49 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


DataFrame[ICUSTAY_ID: int, SUBJECT_ID: int, HADM_ID: int, ICU_LOS_DAYS: double, FIRST_CAREUNIT: string, LAST_CAREUNIT: string, ICU_INTIME: timestamp, ICU_OUTTIME: timestamp, GENDER: string, DOB: timestamp, PATIENT_DIED: int, ADMITTIME: timestamp, DISCHTIME: timestamp, ADMISSION_TYPE: string, ADMISSION_LOCATION: string, INSURANCE: string, ETHNICITY: string, HOSPITAL_DEATH: int, ADMISSION_DIAGNOSIS: string, AGE_AT_ICU_ADMISSION: bigint, GENDER_BINARY: int, IS_EMERGENCY_ADMISSION: int, IS_ELECTIVE_ADMISSION: int, CAME_FROM_ER: int, HAS_MEDICARE: int, IS_WHITE_ETHNICITY: int, FIRST_UNIT_MICU: int, FIRST_UNIT_SICU: int, FIRST_UNIT_CSRU: int, FIRST_UNIT_CCU: int, FIRST_UNIT_TSICU: int, CHANGED_ICU_UNIT: int, ADMISSION_TO_ICU_HOURS: double, ICU_LOS_HOURS: double, WEEKEND_ADMISSION: int, NIGHT_ADMISSION: int]

## Show Dataset Info

In [9]:
print("✅ Master ICU dataset created!")
print(f"📏 Dataset size: {base_icu_df.count():,} ICU stays")
print(f"📊 Features created: {len(base_icu_df.columns)} columns")

# Display sample of the dataset
print("\n📋 Sample of regression features:")
base_icu_df.select(
    "ICUSTAY_ID", "AGE_AT_ICU_ADMISSION", "GENDER_BINARY", "ICU_LOS_DAYS", 
    "FIRST_CAREUNIT", "IS_EMERGENCY_ADMISSION", "ADMISSION_TO_ICU_HOURS"
).show(5)

# Show basic statistics of target variable
print("\n📈 ICU Length of Stay Statistics:")
base_icu_df.select("ICU_LOS_DAYS").describe().show()

print(f"\n⏰ Feature engineering completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Master ICU dataset created!


                                                                                

📏 Dataset size: 16 ICU stays
📊 Features created: 36 columns

📋 Sample of regression features:
+----------+--------------------+-------------+------------+--------------+----------------------+----------------------+
|ICUSTAY_ID|AGE_AT_ICU_ADMISSION|GENDER_BINARY|ICU_LOS_DAYS|FIRST_CAREUNIT|IS_EMERGENCY_ADMISSION|ADMISSION_TO_ICU_HOURS|
+----------+--------------------+-------------+------------+--------------+----------------------+----------------------+
|    255819|                  56|            0|       0.758|          MICU|                     1|     4.977777777777778|
|    231977|                  30|            0|      0.9792|          MICU|                     1|     69.51611111111112|
|    264061|                  51|            1|      1.0576|          CSRU|                     1|  0.016944444444444446|
|    248205|                  47|            0|        4.05|          MICU|                     1|   0.04055555555555555|
|    279243|                  29|            1|     

## Extracting Clinical Features

In [10]:
# Key vital signs ITEMID mappings (common across MIMIC-III)
vital_signs_items = {
    220045: "HEART_RATE",      # Heart Rate
    220050: "SBP",             # Systolic BP  
    220051: "DBP",             # Diastolic BP
    220210: "RESP_RATE",       # Respiratory Rate
    223762: "TEMPERATURE",     # Temperature Celsius
    220277: "SPO2"             # Oxygen Saturation
}

# Get ICU IDs (using existing approach)
icu_ids_list = [row["ICUSTAY_ID"] for row in base_icu_df.select("ICUSTAY_ID").collect()]
print(f"🎯 Processing {len(icu_ids_list)} ICU stays for {len(vital_signs_items)} vital signs")

vital_items_list = list(vital_signs_items.keys())



#chartevents_sample = chartevents_df.sample(0.5, seed=42)  # 50% of 33GB
#print("✅ Created CHARTEVENTS sample for processing")



print("📊 Filtering CHARTEVENTS...")
chartevents_prefiltered = chartevents_df \
    .filter(col("ITEMID").isin(vital_items_list)) \
    .filter(col("VALUENUM").isNotNull()) \
    .filter(col("VALUENUM").between(1, 500)) \
    .filter(col("ICUSTAY_ID").isin(icu_ids_list)) \
    .filter(col("CHARTTIME").isNotNull()) \
    .join(base_icu_df.select("ICUSTAY_ID", "ICU_INTIME", "ICU_OUTTIME"), "ICUSTAY_ID", "inner") \
    .filter(col("CHARTTIME").between(col("ICU_INTIME"), col("ICU_OUTTIME"))) \
    .select("ICUSTAY_ID", "ITEMID", "CHARTTIME", "VALUENUM") \
    .repartition(50, "ICUSTAY_ID")

# This should complete quickly now!
#filtered_count = chartevents_prefiltered.count()
#print(f"✅ Filtered CHARTEVENTS sample: {filtered_count:,} rows")

# Continue with your processing pipeline
print("📊 Processing vital signs within first 24 hours...")

vitals_24h = chartevents_prefiltered.alias("ce") \
    .join(base_icu_df.select("ICUSTAY_ID", "ICU_INTIME"), "ICUSTAY_ID", "inner") \
    .filter(
        col("ce.CHARTTIME").between(
            col("ICU_INTIME"), 
            col("ICU_INTIME") + expr("INTERVAL 24 HOURS")
        )
    )

print("✅ 24-hour vital signs data ready")



vital_signs_items = {
    220045: "HEART_RATE",
    220050: "SBP", 
    220051: "DBP",
    220210: "RESP_RATE",
    223762: "TEMPERATURE",
    220277: "SPO2"
}

# Start with base ICU dataframe
vitals_features = base_icu_df.select("ICUSTAY_ID")

# Add each vital sign as separate joins (faster than pivot)
for itemid, name in vital_signs_items.items():
    print(f"   📊 Processing {name}...")
    
    vital_stats = vitals_24h \
        .filter(col("ITEMID") == itemid) \
        .groupBy("ICUSTAY_ID") \
        .agg(
            avg("VALUENUM").alias(f"{name}_AVG"),
            min("VALUENUM").alias(f"{name}_MIN"),
            max("VALUENUM").alias(f"{name}_MAX"),
            stddev("VALUENUM").alias(f"{name}_STD"),
            count("VALUENUM").alias(f"{name}_COUNT")
        )
    
    # Left join to maintain all ICU stays
    vitals_features = vitals_features.join(vital_stats, "ICUSTAY_ID", "left")

    
chartevents_df.unpersist()
vitals_24h.unpersist()

#print("📊 Counting final features...")
#feature_count = vitals_features.count()
print(f"✅ Vital signs features created for vital_features_count ICU stays")

# Show sample of features
#print("📊 Sample features:")
#vitals_features.show(5)



🎯 Processing 16 ICU stays for 6 vital signs
📊 Filtering CHARTEVENTS...
📊 Processing vital signs within first 24 hours...
✅ 24-hour vital signs data ready
   📊 Processing HEART_RATE...
   📊 Processing SBP...
   📊 Processing DBP...
   📊 Processing RESP_RATE...
   📊 Processing TEMPERATURE...
   📊 Processing SPO2...
✅ Vital signs features created for vital_features_count ICU stays


In [11]:
print("\n🧪 Step 2: Creating laboratory features from LABEVENTS...")

# Key lab test ITEMID mappings
lab_items = {
    50912: "CREATININE",       # Creatinine
    50902: "CHLORIDE",         # Chloride
    50931: "GLUCOSE",          # Glucose
    50983: "SODIUM",           # Sodium
    50971: "POTASSIUM",        # Potassium
    51222: "HEMOGLOBIN",       # Hemoglobin
    51265: "PLATELET",         # Platelet Count
    51301: "WBC",              # White Blood Cells
    50820: "PH"                # pH
}

# Filter lab events within first 24 hours of ICU stay
labs_24h = labevents_df.alias("le") \
    .join(base_icu_df.select("ICUSTAY_ID", "HADM_ID", "ICU_INTIME"), "HADM_ID", "inner") \
    .filter(col("le.ITEMID").isin(list(lab_items.keys()))) \
    .filter(col("le.VALUENUM").isNotNull()) \
    .filter(col("le.VALUENUM") > 0) \
    .filter(
        col("le.CHARTTIME").between(
            col("ICU_INTIME") - expr("INTERVAL 6 HOURS"),  # Include pre-ICU labs
            col("ICU_INTIME") + expr("INTERVAL 24 HOURS")
        )
    )

# Calculate lab value statistics
print("   📊 Calculating laboratory statistics (first 24h)...")

labs_stats = labs_24h.groupBy("ICUSTAY_ID", "ITEMID") \
    .agg(
        avg("VALUENUM").alias("avg_value"),
        min("VALUENUM").alias("min_value"),
        max("VALUENUM").alias("max_value"),
        first("VALUENUM").alias("first_value")  # First available value
    )



# Pivot lab results
labs_features = labs_stats.groupBy("ICUSTAY_ID").pivot("ITEMID").agg(
    first("avg_value").alias("avg"),
    first("first_value").alias("first")
)


# Rename lab columns
for itemid, name in lab_items.items():
    labs_features = labs_features \
        .withColumnRenamed(f"{itemid}_avg", f"{name}_AVG") \
        .withColumnRenamed(f"{itemid}_first", f"{name}_FIRST")

print(f"   ✅ Laboratory features created for {labs_features.count():,} ICU stays")


🧪 Step 2: Creating laboratory features from LABEVENTS...
   📊 Calculating laboratory statistics (first 24h)...


[Stage 31:>                                                         (0 + 1) / 1]

   ✅ Laboratory features created for 16 ICU stays


                                                                                

In [12]:
print("\n🏥 Step 3: Creating diagnosis features from ICD codes...")

# Count number of diagnoses per admission (comorbidity burden)
diagnosis_counts = diagnoses_df.groupBy("HADM_ID") \
    .agg(
        count("ICD9_CODE").alias("TOTAL_DIAGNOSES"),
        collect_list("ICD9_CODE").alias("DIAGNOSIS_CODES")
    )

# Create features for common diagnosis categories
diagnosis_features = diagnosis_counts \
    .withColumn("HAS_SEPSIS", 
                when(array_contains(col("DIAGNOSIS_CODES"), "99591") | 
                     array_contains(col("DIAGNOSIS_CODES"), "99592"), 1).otherwise(0)) \
    .withColumn("HAS_RESPIRATORY_FAILURE",
                when(array_contains(col("DIAGNOSIS_CODES"), "51881") |
                     array_contains(col("DIAGNOSIS_CODES"), "51882"), 1).otherwise(0)) \
    .withColumn("HAS_CARDIAC_ARREST",
                when(array_contains(col("DIAGNOSIS_CODES"), "4275"), 1).otherwise(0)) \
    .drop("DIAGNOSIS_CODES")



#diagnosis_counts.unpersist()


print(f"   ✅ Diagnosis features created for {diagnosis_features.count():,} admissions")

print(f"\n⏰ Clinical features completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


🏥 Step 3: Creating diagnosis features from ICD codes...


[Stage 37:>                                                         (0 + 1) / 1]

   ✅ Diagnosis features created for 40 admissions

⏰ Clinical features completed at: 2025-06-01 11:07:05


                                                                                

# Joining All Features

In [13]:
print("📊 Step 1: Joining base features with clinical data...")

# Start with base ICU dataset
final_dataset = base_icu_df



print("   🫀 Adding vital signs features...")
final_dataset = final_dataset.join(vitals_features, "ICUSTAY_ID", "left")

print("   🧪 Adding laboratory features...")
final_dataset = final_dataset.join(labs_features, "ICUSTAY_ID", "left")

print("   🏥 Adding diagnosis features...")
final_dataset = final_dataset.join(diagnosis_features, "HADM_ID", "left")

print(f"✅ All features joined! Final Dataset")

base_icu_df.unpersist()
vitals_features.unpersist()
labs_features.unpersist()
diagnosis_features.unpersist()


print(f"✅ Cleared from memory and disk")

# ============================================================================
# HANDLE MISSING VALUES
# ============================================================================

print("\n🔧 Step 2: Handling missing values...")

# Fill missing diagnosis counts with 0
final_dataset = final_dataset.fillna({
    "TOTAL_DIAGNOSES": 0,
    "HAS_SEPSIS": 0, 
    "HAS_RESPIRATORY_FAILURE": 0,
    "HAS_CARDIAC_ARREST": 0
})

# Fill missing vital signs with population medians (approximate values)
vital_defaults = {
    "HEART_RATE_AVG": 80, "HEART_RATE_MIN": 65, "HEART_RATE_MAX": 100, "HEART_RATE_STD": 15,
    "SBP_AVG": 120, "SBP_MIN": 100, "SBP_MAX": 140, "SBP_STD": 20,
    "DBP_AVG": 70, "DBP_MIN": 55, "DBP_MAX": 85, "DBP_STD": 15,
    "RESP_RATE_AVG": 18, "RESP_RATE_MIN": 12, "RESP_RATE_MAX": 24, "RESP_RATE_STD": 6,
    "TEMPERATURE_AVG": 37.0, "TEMPERATURE_MIN": 36.5, "TEMPERATURE_MAX": 37.5, "TEMPERATURE_STD": 0.5,
    "SPO2_AVG": 97, "SPO2_MIN": 95, "SPO2_MAX": 99, "SPO2_STD": 2
}

final_dataset = final_dataset.fillna(vital_defaults)

# Fill missing lab values with population medians
lab_defaults = {
    "CREATININE_AVG": 1.0, "CREATININE_FIRST": 1.0,
    "CHLORIDE_AVG": 102, "CHLORIDE_FIRST": 102,
    "GLUCOSE_AVG": 120, "GLUCOSE_FIRST": 120,
    "SODIUM_AVG": 140, "SODIUM_FIRST": 140,
    "POTASSIUM_AVG": 4.0, "POTASSIUM_FIRST": 4.0,
    "HEMOGLOBIN_AVG": 11.0, "HEMOGLOBIN_FIRST": 11.0,
    "PLATELET_AVG": 250, "PLATELET_FIRST": 250,
    "WBC_AVG": 8.5, "WBC_FIRST": 8.5,
    "PH_AVG": 7.4, "PH_FIRST": 7.4
}

final_dataset = final_dataset.fillna(lab_defaults)

# Fill remaining missing values with 0
final_dataset = final_dataset.fillna(0)


print("✅ Missing values handled")

print("🔧 Fixing data types for ML...")

# Cast problematic string columns to double
from pyspark.sql.functions import col

string_columns = [
    "CREATININE_FIRST", "GLUCOSE_FIRST", "SODIUM_FIRST", "POTASSIUM_FIRST",
    "HEMOGLOBIN_FIRST", "PLATELET_FIRST", "WBC_FIRST", "PH_FIRST"
]

for col_name in string_columns:
    if col_name in final_dataset.columns:
        final_dataset = final_dataset.withColumn(
            col_name, 
            col(col_name).cast("double")
        )

# Fill any nulls created during conversion
final_dataset = final_dataset.fillna({
    "CREATININE_FIRST": 1.0,
    "GLUCOSE_FIRST": 120.0,
    "SODIUM_FIRST": 140.0,
    "POTASSIUM_FIRST": 4.0,
    "HEMOGLOBIN_FIRST": 11.0,
    "PLATELET_FIRST": 250.0,
    "WBC_FIRST": 8.5,
    "PH_FIRST": 7.4
})

print("✅ Data types fixed!")


📊 Step 1: Joining base features with clinical data...
   🫀 Adding vital signs features...
   🧪 Adding laboratory features...
   🏥 Adding diagnosis features...
✅ All features joined! Final Dataset
✅ Cleared from memory and disk

🔧 Step 2: Handling missing values...
✅ Missing values handled
🔧 Fixing data types for ML...
✅ Data types fixed!


In [14]:
print("\n📋 Step 3: Selecting final features for regression modeling...")

# Define feature columns for modeling
feature_columns = [
    # Demographics
    "AGE_AT_ICU_ADMISSION", "GENDER_BINARY",
    
    # Admission characteristics
    "IS_EMERGENCY_ADMISSION", "IS_ELECTIVE_ADMISSION", "CAME_FROM_ER",
    "HAS_MEDICARE", "IS_WHITE_ETHNICITY", "ADMISSION_TO_ICU_HOURS",
    "WEEKEND_ADMISSION", "NIGHT_ADMISSION",
    
    # ICU unit features
    "FIRST_UNIT_MICU", "FIRST_UNIT_SICU", "FIRST_UNIT_CSRU", 
    "FIRST_UNIT_CCU", "FIRST_UNIT_TSICU", "CHANGED_ICU_UNIT",
    
    # Vital signs (averages)
    "HEART_RATE_AVG", "SBP_AVG", "DBP_AVG", "RESP_RATE_AVG", 
    "TEMPERATURE_AVG", "SPO2_AVG",
    
    # Vital signs (variability)
    "HEART_RATE_STD", "SBP_STD", "DBP_STD", "RESP_RATE_STD", "SPO2_STD",
    
    # Laboratory values
    "CREATININE_FIRST", "GLUCOSE_FIRST", "SODIUM_FIRST", "POTASSIUM_FIRST",
    "HEMOGLOBIN_FIRST", "PLATELET_FIRST", "WBC_FIRST", "PH_FIRST",
    
    # Diagnosis features
    "TOTAL_DIAGNOSES", "HAS_SEPSIS", "HAS_RESPIRATORY_FAILURE", "HAS_CARDIAC_ARREST"
]

# Create modeling dataset with selected features
modeling_dataset = final_dataset.select(
    ["ICUSTAY_ID", "ICU_LOS_DAYS"] + feature_columns
)

# Remove any remaining nulls and invalid records
modeling_dataset = modeling_dataset.filter(col("ICU_LOS_DAYS").isNotNull()) \
    .filter(col("ICU_LOS_DAYS") > 0) \
    .filter(col("AGE_AT_ICU_ADMISSION").between(18,80))

# Cache the final dataset
#modeling_dataset = modeling_dataset.repartition(200)
#modeling_dataset.cache()


print(f"✅ Final modeling dataset prepared!")
#print(f"📏 Final dataset: {modeling_dataset.count():,} ICU stays")
print(f"📊 Total features: {len(feature_columns)} predictive features")
print(f"🎯 Target variable: ICU_LOS_DAYS (continuous)")

# Show feature summary
print(f"\n📋 Feature categories:")
print(f"   👤 Demographics: 2 features")
print(f"   🏥 Admission: 8 features") 
print(f"   🏢 ICU Unit: 6 features")
print(f"   🫀 Vital Signs: 11 features")
print(f"   🧪 Laboratory: 8 features")
print(f"   🩺 Diagnoses: 4 features")

# Display sample of final dataset
#print(f"\n📋 Sample of final modeling dataset:")
#modeling_dataset.select("ICUSTAY_ID", "ICU_LOS_DAYS", "AGE_AT_ICU_ADMISSION", 
#                       "HEART_RATE_AVG", "CREATININE_FIRST", "HAS_SEPSIS").show(5)

# Basic statistics of target variable
#print(f"\n📈 Final ICU Length of Stay Statistics:")
#modeling_dataset.select("ICU_LOS_DAYS").describe().show()

print(f"\n⏰ Dataset preparation completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"🚀 Ready for train/test split and model training!")


📋 Step 3: Selecting final features for regression modeling...
✅ Final modeling dataset prepared!
📊 Total features: 39 predictive features
🎯 Target variable: ICU_LOS_DAYS (continuous)

📋 Feature categories:
   👤 Demographics: 2 features
   🏥 Admission: 8 features
   🏢 ICU Unit: 6 features
   🫀 Vital Signs: 11 features
   🧪 Laboratory: 8 features
   🩺 Diagnoses: 4 features

⏰ Dataset preparation completed at: 2025-06-01 11:07:07
🚀 Ready for train/test split and model training!


## Preparing for Machine Learning

In [15]:
print("📊 Step 1: Creating train/test split...")

# Split the data (80% train, 20% test)
train_data, test_data = modeling_dataset.randomSplit([0.8, 0.2], seed=42)

# Cache both datasets for performance
train_data.cache()
test_data.cache()


modeling_dataset.unpersist()

print(f"✅ Data split completed:")
#print(f"   📈 Training set: {train_data.count():,} ICU stays ({train_data.count()/modeling_dataset.count()*100:.1f}%)")
#print(f"   📊 Test set: {test_data.count():,} ICU stays ({test_data.count()/modeling_dataset.count()*100:.1f}%)")

# Show target variable distribution in both sets
#print(f"\n📈 Target variable distribution:")
#print(f"Training set LOS statistics:")
#train_data.select("ICU_LOS_DAYS").describe().show()

#print(f"Test set LOS statistics:")
#test_data.select("ICU_LOS_DAYS").describe().show()

# ============================================================================
# FEATURE VECTOR ASSEMBLY
# ============================================================================

print("\n🔧 Step 2: Assembling feature vectors...")

# Create feature vector assembler
feature_assembler = VectorAssembler(
    inputCols=feature_columns,
    outputCol="features_raw"
)

# Apply feature assembler to training data
train_assembled = feature_assembler.transform(train_data)
test_assembled = feature_assembler.transform(test_data)

print(f"✅ Feature vectors assembled:")
print(f"   📊 Feature vector size: {len(feature_columns)} dimensions")

# ============================================================================
# FEATURE SCALING
# ============================================================================
'''
print("\n⚖️ Step 3: Scaling features...")

# Create StandardScaler to normalize features
scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features",
    withStd=True,
    withMean=True
)

# Fit scaler on training data
scaler_model = scaler.fit(train_assembled)
train_scaled = scaler_model.transform(train_assembled)
test_scaled = scaler_model.transform(test_assembled)

# Cache the final processed datasets
train_scaled.cache()
test_scaled.cache()


print(f"✅ Feature scaling completed:")
print(f"   📊 Features standardized (mean=0, std=1)")
print(f"   🔧 Scaler fitted on training data only")
'''

📊 Step 1: Creating train/test split...
✅ Data split completed:

🔧 Step 2: Assembling feature vectors...
✅ Feature vectors assembled:
   📊 Feature vector size: 39 dimensions


'\nprint("\n⚖️ Step 3: Scaling features...")\n\n# Create StandardScaler to normalize features\nscaler = StandardScaler(\n    inputCol="features_raw",\n    outputCol="features",\n    withStd=True,\n    withMean=True\n)\n\n# Fit scaler on training data\nscaler_model = scaler.fit(train_assembled)\ntrain_scaled = scaler_model.transform(train_assembled)\ntest_scaled = scaler_model.transform(test_assembled)\n\n# Cache the final processed datasets\ntrain_scaled.cache()\ntest_scaled.cache()\n\n\nprint(f"✅ Feature scaling completed:")\nprint(f"   📊 Features standardized (mean=0, std=1)")\nprint(f"   🔧 Scaler fitted on training data only")\n'

## Final Dataset Preparation

In [16]:

print("\n📋 Step 4: Preparing final ML datasets...")

# Select columns needed for modeling
ml_columns = ["ICUSTAY_ID", "ICU_LOS_DAYS", "features"]

#train_final = train_scaled.select(ml_columns).withColumnRenamed("ICU_LOS_DAYS", "label")
#test_final = test_scaled.select(ml_columns).withColumnRenamed("ICU_LOS_DAYS", "label")

#train_final = train_assembled.select(ml_columns).withColumnRenamed("ICU_LOS_DAYS", "label")
#test_final = test_assembled.select(ml_columns).withColumnRenamed("ICU_LOS_DAYS", "label")


train_final = train_assembled.select("ICUSTAY_ID", "ICU_LOS_DAYS", "features_raw") \
    .withColumnRenamed("ICU_LOS_DAYS", "label") \
    .withColumnRenamed("features_raw", "features")

test_final = test_assembled.select("ICUSTAY_ID", "ICU_LOS_DAYS", "features_raw") \
    .withColumnRenamed("ICU_LOS_DAYS", "label") \
    .withColumnRenamed("features_raw", "features")




print("\n📋 Caching...")

# Cache final datasets
train_final.cache()
test_final.cache()

print(f"✅ Final ML datasets prepared:")
print(f"   🎯 Target variable: 'label' (ICU_LOS_DAYS)")
print(f"   📊 Features: 'features' (scaled vector)")
print(f"   🔑 Identifier: 'ICUSTAY_ID'")

# Show sample of final datasets
#print(f"\n📋 Sample of training data structure:")
#train_final.select("ICUSTAY_ID", "label").show(5)

#print(f"\n📋 Feature vector example (first 10 features):")
# Show first few elements of feature vector for one sample
#sample_features = train_final.select("features").take(1)[0]["features"]
#print(f"   📊 Feature vector sample: {sample_features.toArray()[:10]}...")
#print(f"   📏 Total feature dimensions: {len(sample_features.toArray())}")

# ============================================================================
# DATA QUALITY CHECKS
# ============================================================================
'''
print(f"\n🔍 Step 5: Final data quality checks...")

# Check for any remaining nulls
train_nulls = train_final.filter(col("label").isNull() | col("features").isNull()).count()
test_nulls = test_final.filter(col("label").isNull() | col("features").isNull()).count()

print(f"   🔍 Null values in training set: {train_nulls}")
print(f"   🔍 Null values in test set: {test_nulls}")

# Show target variable ranges
train_stats = train_final.agg(
    min("label").alias("min_los"),
    max("label").alias("max_los"), 
    avg("label").alias("mean_los"),
    stddev("label").alias("std_los")
).collect()[0]

print(f"\n📊 Final training set target statistics:")
print(f"   📉 Min LOS: {train_stats['min_los']:.2f} days")
print(f"   📈 Max LOS: {train_stats['max_los']:.2f} days") 
print(f"   📊 Mean LOS: {train_stats['mean_los']:.2f} days")
print(f"   📏 Std LOS: {train_stats['std_los']:.2f} days")

print(f"\n✅ Data preprocessing completed successfully!")
print(f"🚀 Ready for model training with {len(feature_columns)} features")
'''

print(f"⏰ Preprocessing completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


📋 Step 4: Preparing final ML datasets...

📋 Caching...
✅ Final ML datasets prepared:
   🎯 Target variable: 'label' (ICU_LOS_DAYS)
   📊 Features: 'features' (scaled vector)
   🔑 Identifier: 'ICUSTAY_ID'
⏰ Preprocessing completed at: 2025-06-01 11:07:12


## Training Multiple Models

In [17]:
print("📊 Step 1: Setting up evaluation metrics...")

# Create regression evaluators
rmse_evaluator = RegressionEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="rmse"
)

mae_evaluator = RegressionEvaluator(
    labelCol="label",
    predictionCol="prediction", 
    metricName="mae"
)

r2_evaluator = RegressionEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="r2"
)

print("✅ Evaluation metrics configured: RMSE, MAE, R²")

📊 Step 1: Setting up evaluation metrics...
✅ Evaluation metrics configured: RMSE, MAE, R²


### Linear Regression

In [18]:
print("\n📈 Step 2: Training Linear Regression model...")
print(f"🕐 Started at: {datetime.now().strftime('%H:%M:%S')}")
start_time = time.time()

# Create Linear Regression model
lr = LinearRegression(
   featuresCol="features",
   labelCol="label",
   maxIter=100,
   regParam=0.01,
   elasticNetParam=0.0,
   tol=1e-6,
   standardization=True,
   fitIntercept=True
)


# Train the model
print("   🔄 Training Linear Regression...")
lr_model = lr.fit(train_final)

print("   🔄 Linear Regression - Making predictions (test data)...")
lr_predictions = lr_model.transform(test_final)

print("   🔄 Linear Regression - Evaluation...")
lr_rmse = rmse_evaluator.evaluate(lr_predictions)
lr_mae = mae_evaluator.evaluate(lr_predictions)
lr_r2 = r2_evaluator.evaluate(lr_predictions)

print(f"✅ Linear Regression Results:")
print(f"   📉 RMSE: {lr_rmse:.3f} days")
print(f"   📊 MAE: {lr_mae:.3f} days")
print(f"   📈 R²: {lr_r2:.3f}")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"🕐 Completed at: {datetime.now().strftime('%H:%M:%S')}")
print(f"⏱️ Total elapsed time: {elapsed_time:.2f} seconds")


# Linear Regression Predictions

print("\n📈 Linear Regression Predictions (Sample 20):")
lr_display = lr_predictions.select(
    "ICUSTAY_ID",
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
).orderBy("ICUSTAY_ID")

lr_display.show(20, truncate=False)


📈 Step 2: Training Linear Regression model...
🕐 Started at: 11:07:13
   🔄 Training Linear Regression...


                                                                                

   🔄 Linear Regression - Making predictions (test data)...
   🔄 Linear Regression - Evaluation...


                                                                                

✅ Linear Regression Results:
   📉 RMSE: 2.841 days
   📊 MAE: 2.778 days
   📈 R²: -272.669
🕐 Completed at: 11:07:42
⏱️ Total elapsed time: 29.94 seconds

📈 Linear Regression Predictions (Sample 20):




+----------+----------+-------------+--------------+-------------+
|ICUSTAY_ID|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+----------+-------------+--------------+-------------+
|231977    |0.9792    |4.254        |3.275         |334.45       |
|252713    |0.848     |3.966        |3.118         |367.67       |
|298190    |1.2597    |3.2          |1.94          |154.01       |
+----------+----------+-------------+--------------+-------------+





### Random Forest

In [19]:

print("\n🌲 Step 3: Training Random Forest model...")
print(f"🕐 Started at: {datetime.now().strftime('%H:%M:%S')}")
start_time = time.time()

# Create Random Forest model
rf = RandomForestRegressor(
    featuresCol="features",
    labelCol="label",
    numTrees=100,                   # Standard ensemble size
    maxDepth=8,                     # Good depth for complex patterns
    minInstancesPerNode=5,          # Allow granular splits
    subsamplingRate=0.8,            # 80% data sampling
    featureSubsetStrategy="sqrt",   # sqrt(num_features) per split
    maxBins=32,                     # Standard binning
    impurity="variance",            # For regression
    maxMemoryInMB=256,              # Standard memory allocation
    cacheNodeIds=True,              # Cache for performance
    checkpointInterval=10,          # Checkpoint every 10 iterations
    seed=42
)

print("   🔄 Training Random Forest...")
rf_model = rf.fit(train_final)

print("   🔄 Random Forest - Making predictions (test data)...")
rf_predictions = rf_model.transform(test_final)

print("   🔄 Random Forest - Evaluation...")
rf_rmse = rmse_evaluator.evaluate(rf_predictions)
rf_mae = mae_evaluator.evaluate(rf_predictions)
rf_r2 = r2_evaluator.evaluate(rf_predictions)

print(f"✅ Random Forest Results:")
print(f"   📉 RMSE: {rf_rmse:.3f} days")
print(f"   📊 MAE: {rf_mae:.3f} days")
print(f"   📈 R²: {rf_r2:.3f}")


end_time = time.time()
elapsed_time = end_time - start_time
print(f"🕐 Completed at: {datetime.now().strftime('%H:%M:%S')}")
print(f"⏱️ Total elapsed time: {elapsed_time:.2f} seconds")


# Random Forest Predictions
print("\n🌲 Random Forest Predictions (Sample 20):")
rf_display = rf_predictions.select(
    "ICUSTAY_ID",
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
).orderBy("ICUSTAY_ID")

rf_display.show(20, truncate=False)



🌲 Step 3: Training Random Forest model...
🕐 Started at: 11:07:45
   🔄 Training Random Forest...


25/06/01 11:07:46 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 13 (= number of training instances)


   🔄 Random Forest - Making predictions (test data)...
   🔄 Random Forest - Evaluation...


                                                                                

✅ Random Forest Results:
   📉 RMSE: 1.597 days
   📊 MAE: 1.588 days
   📈 R²: -85.532
🕐 Completed at: 11:08:06
⏱️ Total elapsed time: 21.26 seconds

🌲 Random Forest Predictions (Sample 20):




+----------+----------+-------------+--------------+-------------+
|ICUSTAY_ID|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+----------+-------------+--------------+-------------+
|231977    |0.9792    |2.507        |1.527         |155.98       |
|252713    |0.848     |2.675        |1.827         |215.5        |
|298190    |1.2597    |2.668        |1.408         |111.78       |
+----------+----------+-------------+--------------+-------------+



                                                                                

## Model Comparison

In [20]:
print("\n🏆 Step 5: Model Performance Comparison...")

# Create comparison summary
results_data = [
    ("Linear Regression", lr_rmse, lr_mae, lr_r2),
    ("Random Forest", rf_rmse, rf_mae, rf_r2)
]

results_df = spark.createDataFrame(results_data, ["Model", "RMSE", "MAE", "R2"])

print("📊 Model Performance Summary:")
results_df.show(truncate=False)

# Find best model
import operator
import builtins
best_rmse_model = builtins.min(results_data, key=operator.itemgetter(1))
best_r2_model = builtins.max(results_data, key=operator.itemgetter(3))

print(f"\n🥇 Best Models:")
print(f"   🎯 Lowest RMSE: {best_rmse_model[0]} ({best_rmse_model[1]:.3f} days)")
print(f"   📈 Highest R²: {best_r2_model[0]} ({best_r2_model[3]:.3f})")


🏆 Step 5: Model Performance Comparison...
📊 Model Performance Summary:




+-----------------+------------------+------------------+------------------+
|Model            |RMSE              |MAE               |R2                |
+-----------------+------------------+------------------+------------------+
|Linear Regression|2.8407590201167916|2.777594641355016 |-272.669119278364 |
|Random Forest    |1.59739046150588  |1.5876170346681093|-85.53248573702092|
+-----------------+------------------+------------------+------------------+


🥇 Best Models:
   🎯 Lowest RMSE: Random Forest (1.597 days)
   📈 Highest R²: Random Forest (-85.532)


                                                                                

## Display Predictions

In [21]:

# Linear Regression Predictions

print("\n📈 Linear Regression Predictions (Sample 20):")
lr_display = lr_predictions.select(
    "ICUSTAY_ID",
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
).orderBy("ICUSTAY_ID")

lr_display.show(20, truncate=False)



# Random Forest Predictions
print("\n🌲 Random Forest Predictions (Sample 20):")
rf_display = rf_predictions.select(
    "ICUSTAY_ID",
    col("label").alias("Actual_LOS"),
    round(col("prediction"), 3).alias("Predicted_LOS"),
    round(abs(col("label") - col("prediction")), 3).alias("Absolute_Error"),
    round(((abs(col("label") - col("prediction")) / col("label")) * 100), 2).alias("Percent_Error")
).orderBy("ICUSTAY_ID")

rf_display.show(20, truncate=False)


📈 Linear Regression Predictions (Sample 20):


                                                                                

+----------+----------+-------------+--------------+-------------+
|ICUSTAY_ID|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+----------+-------------+--------------+-------------+
|231977    |0.9792    |4.254        |3.275         |334.45       |
|252713    |0.848     |3.966        |3.118         |367.67       |
|298190    |1.2597    |3.2          |1.94          |154.01       |
+----------+----------+-------------+--------------+-------------+


🌲 Random Forest Predictions (Sample 20):
+----------+----------+-------------+--------------+-------------+
|ICUSTAY_ID|Actual_LOS|Predicted_LOS|Absolute_Error|Percent_Error|
+----------+----------+-------------+--------------+-------------+
|231977    |0.9792    |2.507        |1.527         |155.98       |
|252713    |0.848     |2.675        |1.827         |215.5        |
|298190    |1.2597    |2.668        |1.408         |111.78       |
+----------+----------+-------------+--------------+-------------+



