# Tutorial 9: Descriptive Machine Learning

## Background

At Heilbronn Hospital's ICU department, the medical team is looking to develop predictive models to better estimate patient length of stay (LOS). Accurate LOS predictions are crucial for resource planning, staff scheduling, and improving patient care coordination. As a data scientist on the team, you'll use machine learning techniques to analyze historical patient data and build models that can help the hospital make more informed decisions about resource allocation and patient care management.

1. Data Preparation: To begin our predictive modeling project, import the historical ICU data from 'icu_data_example.csv'. Following best practices in machine learning, use `scikit-learn` to split the data into training and testing sets. This split will allow us to develop our model on one portion of the data and validate its performance on unseen cases, similar to how the model would perform on future ICU admissions.


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("icu_data_example.csv")
df = df.drop(columns=['stay_id'])

X = df.drop(columns=['los'])
y = df['los']

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

numeric_transformer = SimpleImputer(strategy='mean')
categorical_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")


Mean Squared Error (MSE): 6.48
R² Score: 0.02


2. Initial Model Development: The ICU team wants to understand how patient characteristics influence length of stay. Using `scikit-learn`, implement a simple linear regression model to predict the 'los' (length of stay) based on key patient metrics such as age and weight. This baseline model will help us identify the most significant factors affecting a patient's hospital stay duration.


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("icu_data_example.csv")

X = df[['age', 'weight']]
y = df['los']

X = X.fillna(X.mean())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Linear Regression using age and weight")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")
print("Model Coefficients:")
print(f"  Age coefficient: {model.coef_[0]:.4f}")
print(f"  Weight coefficient: {model.coef_[1]:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

Linear Regression using age and weight
Mean Squared Error (MSE): 6.59
R² Score: -0.00
Model Coefficients:
  Age coefficient: 0.0021
  Weight coefficient: 0.0061
  Intercept: 2.0473


3. Model Evaluation: To assess how well our model might perform in real-world scenarios, evaluate its predictions on the testing data using the mean squared error metric. This will give the medical team a concrete measure of the model's accuracy in predicting length of stay, helping them understand the reliability of the predictions for resource planning.


- Optional Investigation: The medical team is particularly interested in understanding which patient characteristics have the strongest influence on length of stay. How can we examine the detailed model coefficients to identify these key factors?


In [11]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)

print("Model Evaluation on Test Data")
print(f"Mean Squared Error (MSE): {mse:.2f}")

Model Evaluation on Test Data
Mean Squared Error (MSE): 6.59


4. Advanced Statistical Analysis: For a more detailed statistical analysis, use `StatsModels` to fit another linear regression model. This library will provide additional statistical insights that the medical team can use to validate their clinical assumptions about factors affecting length of stay.


In [12]:
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("icu_data_example.csv")

X = df[['age', 'weight']]
y = df['los']

X = X.fillna(X.mean())
y = y.fillna(y.mean())

X_with_const = sm.add_constant(X)

model = sm.OLS(y, X_with_const).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    los   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     3.330
Date:                Sat, 12 Jul 2025   Prob (F-statistic):             0.0359
Time:                        22:14:30   Log-Likelihood:                -7080.5
No. Observations:                3000   AIC:                         1.417e+04
Df Residuals:                    2997   BIC:                         1.418e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0323      0.271      7.503      0.0

5. Clinical Interpretation: The hospital's research team needs a detailed interpretation of how each patient factor affects length of stay. Analyze the regression coefficients, discuss their statistical significance, and evaluate the overall model fit using the R-squared value. This analysis will help clinicians understand which patient characteristics are most predictive of extended ICU stays.


In [13]:
model = sm.OLS(y, sm.add_constant(X)).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    los   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     3.330
Date:                Sat, 12 Jul 2025   Prob (F-statistic):             0.0359
Time:                        22:14:30   Log-Likelihood:                -7080.5
No. Observations:                3000   AIC:                         1.417e+04
Df Residuals:                    2997   BIC:                         1.418e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0323      0.271      7.503      0.0

6. Handling Categorical Data: The ICU dataset includes various categorical variables such as admission type and patient status. Use `Pandas` or `scikit-learn` to perform one-hot encoding on these categorical variables, making them suitable for our machine learning model while preserving their clinical significance.


In [14]:
import pandas as pd

df = pd.read_csv("icu_data_example.csv")

categorical_cols = df.select_dtypes(include='object').columns
print("Categorical columns:", categorical_cols.tolist())

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print(df_encoded.head())

Categorical columns: ['gender']
    stay_id        los  age     weight  Heart Rate  Arterial O2 pressure  \
0  39571806  11.578495   44  80.149706   67.416667            120.500000   
1  32660070   1.278322   74  85.000000   99.857143            110.742857   
2  38333163  10.472477   63  92.487409   77.285714             98.000000   
3  30825022   1.140208   49  84.500000   76.416667            109.800000   
4  36534848   0.661655   76  56.381486   69.166667            102.000000   

   Magnesium  HCO3 (serum)  PH (Arterial)  gender_male  
0   1.766667           NaN       7.425000         True  
1        NaN           NaN       7.440571        False  
2   2.550000          25.0       7.460000         True  
3   1.800000           NaN            NaN        False  
4   2.000000          25.0       7.404000        False  


7. Feature Standardization: Our ICU measurements (age, weight, Heart Rate, Arterial O2 pressure, Magnesium, HCO3 (serum), and PH (Arterial)) are recorded in different units and scales. Standardize these features to ensure each variable contributes appropriately to the model's predictions, preventing any single measurement from dominating the analysis due to its scale.


In [15]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("icu_data_example.csv")

features_to_scale = [
    'age', 'weight', 'Heart Rate', 'Arterial O2 pressure',
    'Magnesium', 'HCO3 (serum)', 'PH (Arterial)'
]

df_scaled = df.copy()
df_scaled[features_to_scale] = df_scaled[features_to_scale].fillna(df_scaled[features_to_scale].mean())

scaler = StandardScaler()
df_scaled[features_to_scale] = scaler.fit_transform(df_scaled[features_to_scale])

print(df_scaled[features_to_scale].head())

        age    weight  Heart Rate  Arterial O2 pressure  Magnesium  \
0 -1.162818 -0.045653   -1.050302             -0.414367  -0.917159   
1  0.622166  0.167318    0.853380             -0.597428   0.000000   
2 -0.032328  0.496083   -0.471163             -0.836506   1.721669   
3 -0.865321  0.145364   -0.522161             -0.615117  -0.804868   
4  0.741165 -1.089290   -0.947607             -0.761459  -0.131125   

   HCO3 (serum)  PH (Arterial)  
0 -1.072613e-15   8.065339e-01  
1 -1.072613e-15   1.167925e+00  
2  4.983771e-01   1.618834e+00  
3 -1.072613e-15  -2.061335e-14  
4  4.983771e-01   3.191539e-01  
