# Tutorial 9: Descriptive Machine Learning

## Background

At Heilbronn Hospital's ICU department, the medical team is looking to develop predictive models to better estimate patient length of stay (LOS). Accurate LOS predictions are crucial for resource planning, staff scheduling, and improving patient care coordination. As a data scientist on the team, you'll use machine learning techniques to analyze historical patient data and build models that can help the hospital make more informed decisions about resource allocation and patient care management.

1. Data Preparation: To begin our predictive modeling project, import the historical ICU data from 'icu_data_example.csv'. Following best practices in machine learning, use `scikit-learn` to split the data into training and testing sets. This split will allow us to develop our model on one portion of the data and validate its performance on unseen cases, similar to how the model would perform on future ICU admissions.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("icu_data_example.csv")
df = df.drop(columns=['stay_id'])

X = df.drop(columns=['los'])
y = df['los']

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

numeric_transformer = SimpleImputer(strategy='mean')
categorical_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")

2. Initial Model Development: The ICU team wants to understand how patient characteristics influence length of stay. Using `scikit-learn`, implement a simple linear regression model to predict the 'los' (length of stay) based on key patient metrics such as age and weight. This baseline model will help us identify the most significant factors affecting a patient's hospital stay duration.


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("icu_data_example.csv")

X = df[['age', 'weight']]
y = df['los']

X = X.fillna(X.mean())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Linear Regression using age and weight")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")
print("Model Coefficients:")
print(f"  Age coefficient: {model.coef_[0]:.4f}")
print(f"  Weight coefficient: {model.coef_[1]:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

3. Model Evaluation: To assess how well our model might perform in real-world scenarios, evaluate its predictions on the testing data using the mean squared error metric. This will give the medical team a concrete measure of the model's accuracy in predicting length of stay, helping them understand the reliability of the predictions for resource planning.


- Optional Investigation: The medical team is particularly interested in understanding which patient characteristics have the strongest influence on length of stay. How can we examine the detailed model coefficients to identify these key factors?


In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)

print("Model Evaluation on Test Data")
print(f"Mean Squared Error (MSE): {mse:.2f}")

4. Advanced Statistical Analysis: For a more detailed statistical analysis, use `StatsModels` to fit another linear regression model. This library will provide additional statistical insights that the medical team can use to validate their clinical assumptions about factors affecting length of stay.


In [None]:
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("icu_data_example.csv")

X = df[['age', 'weight']]
y = df['los']

X = X.fillna(X.mean())
y = y.fillna(y.mean())

X_with_const = sm.add_constant(X)

model = sm.OLS(y, X_with_const).fit()

print(model.summary())

5. Clinical Interpretation: The hospital's research team needs a detailed interpretation of how each patient factor affects length of stay. Analyze the regression coefficients, discuss their statistical significance, and evaluate the overall model fit using the R-squared value. This analysis will help clinicians understand which patient characteristics are most predictive of extended ICU stays.


In [None]:
model = sm.OLS(y, sm.add_constant(X)).fit()
print(model.summary())

6. Handling Categorical Data: The ICU dataset includes various categorical variables such as admission type and patient status. Use `Pandas` or `scikit-learn` to perform one-hot encoding on these categorical variables, making them suitable for our machine learning model while preserving their clinical significance.


In [None]:
import pandas as pd

df = pd.read_csv("icu_data_example.csv")

categorical_cols = df.select_dtypes(include='object').columns
print("Categorical columns:", categorical_cols.tolist())

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print(df_encoded.head())

7. Feature Standardization: Our ICU measurements (age, weight, Heart Rate, Arterial O2 pressure, Magnesium, HCO3 (serum), and PH (Arterial)) are recorded in different units and scales. Standardize these features to ensure each variable contributes appropriately to the model's predictions, preventing any single measurement from dominating the analysis due to its scale.


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("icu_data_example.csv")

features_to_scale = [
    'age', 'weight', 'Heart Rate', 'Arterial O2 pressure',
    'Magnesium', 'HCO3 (serum)', 'PH (Arterial)'
]

df_scaled = df.copy()
df_scaled[features_to_scale] = df_scaled[features_to_scale].fillna(df_scaled[features_to_scale].mean())

scaler = StandardScaler()
df_scaled[features_to_scale] = scaler.fit_transform(df_scaled[features_to_scale])

print(df_scaled[features_to_scale].head())

# Homework 09
8. Comprehensive Model Development:

The hospital administration has requested a robust, production-ready model for predicting ICU length of stay. Using both `scikit-learn` and `StatsModels`, develop comprehensive linear regression models that account for all aspects of real-world medical data. Your workflow must include:

- Handling missing values appropriately, considering the clinical significance of missing data
- Identifying and managing abnormal or outlier measurements that could be equipment errors or unusual patient cases
- Encoding categorical variables effectively while preserving their medical meaning
- Addressing scale differences between various vital signs and lab measurements

To ensure maintainability and flexibility, create separate Python functions for each preprocessing step (missing value imputation, outlier handling, categorical encoding, and feature scaling). These functions should be parameterized to allow different methods to be applied based on future requirements or medical staff feedback.


In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# FUNCTION 1
# Handling missing values appropriately and considering the clinical significance of missing data
def impute_missing(df, strategy='mean', columns=None):
    df_copy = df.copy()
    if columns is None:
        columns = df_copy.select_dtypes(include=[np.number]).columns.tolist()
    imputer = SimpleImputer(strategy=strategy)
    df_copy[columns] = imputer.fit_transform(df_copy[columns])
    return df_copy

# FUNCTION 2
# Identifying and managing outlier measurements
def remove_outliers(df, columns, z_thresh=3):
    df_copy = df.copy()
    for col in columns:
        if df_copy[col].std() == 0:
            continue
        z_scores = (df_copy[col] - df_copy[col].mean()) / df_copy[col].std()
        df_copy = df_copy[np.abs(z_scores) < z_thresh]
    return df_copy

# FUNCTION 3
# Encoding categorical variables effectively
def encode_categoricals(df, drop_first=True):
    df_encoded = pd.get_dummies(df, drop_first=drop_first)
    return df_encoded

# FUNCTION 4
# Addressing scale differences in features
def standardize_features(df, columns):
    scaler = StandardScaler()
    df_scaled = df.copy()
    df_scaled[columns] = scaler.fit_transform(df_scaled[columns])
    return df_scaled

# Data loading and preprocessing
df = pd.read_csv("icu_data_example.csv")

if 'stay_id' in df.columns:
    df = df.drop(columns=['stay_id'])

target_col = 'los'
numeric_cols = df.select_dtypes(include=[np.number]).drop(columns=[target_col]).columns.tolist()
categorical_cols = df.select_dtypes(include='object').columns.tolist()

# Step-by-step preprocessing
df = impute_missing(df, strategy='mean', columns=numeric_cols)
df = remove_outliers(df, columns=numeric_cols)
df = encode_categoricals(df, drop_first=True)
df = standardize_features(df, columns=numeric_cols)

# Final x & y values
X = df.drop(columns=[target_col])
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# EXAMPLE: Training a linear regression model:
model_sklearn = LinearRegression()
model_sklearn.fit(X_train, y_train)

y_pred = model_sklearn.predict(X_test)

# Eval
print("Test case: linear regression")
print(f"  ➤ MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"  ➤ R²: {r2_score(y_test, y_pred):.2f}")
print("  ➤ Coefficients:")
for feature, coef in zip(X.columns, model_sklearn.coef_):
    print(f"     {feature}: {coef:.4f}")