Прогноз длительности пребывания в стационаре

**Цель:** Построить модели регрессии для предсказания `time_in_hospital` и оценить факторы, влияющие на длительность госпитализации.

1. Baseline: наивное предсказание медианой + оценка MAE и RMSE
2. LinearRegression, Ridge, Lasso: обучение, оценка на тесте
3. Отбор признаков и регуляризация: анализ коэффициентов, удаление нерелевантных признаков
4. Тюнинг гиперпараметров для лучшей модели
5. Интерпретация результатов и выводы

Ячейка 1. Импорты и загрузка данных

In [30]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV

df = pd.read_csv("../data/processed/df_clean.csv")
df.head()

Unnamed: 0,race,gender,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,...,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted,age_num
0,Caucasian,Female,6,25,1,1,Unknown,Pediatrics-Endocrinology,41,0,...,No,No,No,No,No,No,No,No,NO,5.0
1,Caucasian,Female,1,1,7,3,Unknown,Unknown,59,0,...,Up,No,No,No,No,No,Ch,Yes,>30,15.0
2,AfricanAmerican,Female,1,1,7,2,Unknown,Unknown,11,5,...,No,No,No,No,No,No,No,Yes,NO,25.0
3,Caucasian,Male,1,1,7,2,Unknown,Unknown,44,1,...,Up,No,No,No,No,No,Ch,Yes,NO,35.0
4,Caucasian,Male,1,1,7,1,Unknown,Unknown,51,0,...,Steady,No,No,No,No,No,Ch,Yes,NO,45.0


Ячейка 2. Формирование X, y и разделение на train/test

In [31]:
# Отделяем таргет
X = df.drop(columns=["time_in_hospital", "readmitted"])
y = df["time_in_hospital"]

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Train shape: (71236, 43), Test shape: (30530, 43)


Ячейка 3. Определение признаков и препроцессор

In [32]:
# 1) Определяем числовые и категориальные столбцы
num_cols = df.select_dtypes(include=["int64", "float64"]).columns.drop("time_in_hospital").tolist()
cat_cols = [c for c in df.columns if c not in num_cols + ["time_in_hospital", "readmitted"]]

# 2) Собираем список всех уникальных категорий для каждого категориального признака
categories = [
    np.sort(df[col].dropna().unique()) for col in cat_cols
]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(
        categories=categories,
        handle_unknown="ignore",
        sparse_output=True
    ), cat_cols),
])

Ячейка 4. Сборка Pipeline и обучение модели

In [33]:
# Pipeline: предобработка LinearRegression
pipe_lr = Pipeline([
    ("prep", preprocessor),
    ("lr",  LinearRegression())
])

# Обучение
pipe_lr.fit(X_train, y_train)

0,1,2
,steps,"[('prep', ...), ('lr', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,"[array(['Afric... dtype=object), array(['Femal... dtype=object), ...]"
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


Ячейка 5. Оценка качества модели

In [34]:
# Предсказания на тесте
y_pred_lr = pipe_lr.predict(X_test)

# Метрики
mae_lr  = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
print(f"LinearRegression MAE: {mae_lr:.3f}, RMSE: {rmse_lr:.3f}")

LinearRegression MAE: 1.721, RMSE: 2.278


 Ячейка 6. Ridge и Lasso - обучение и метрики

In [35]:
# 1) Ridge
pipe_ridge = Pipeline([
    ("prep", preprocessor),
    ("model", Ridge(alpha=1.0, random_state=42))
])
pipe_ridge.fit(X_train, y_train)
y_pred_ridge = pipe_ridge.predict(X_test)
mae_ridge  = mean_absolute_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print(f"Ridge (α=1)   — MAE: {mae_ridge:.3f}, RMSE: {rmse_ridge:.3f}")

# 2) Lasso
pipe_lasso = Pipeline([
    ("prep", preprocessor),
    ("model", Lasso(alpha=0.1, random_state=42, max_iter=5000))
])
pipe_lasso.fit(X_train, y_train)
y_pred_lasso = pipe_lasso.predict(X_test)
mae_lasso  = mean_absolute_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print(f"Lasso (α=0.1) — MAE: {mae_lasso:.3f}, RMSE: {rmse_lasso:.3f}")

Ridge (α=1)   — MAE: 1.714, RMSE: 2.267
Lasso (α=0.1) — MAE: 1.932, RMSE: 2.533


Ячейка 7. Тюнинг `alpha` для Ridge и Lasso

In [36]:
# 1) Общая сетка альфа
param_grid = {'model__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# 2) Ridge с GridSearchCV
pipe_ridge = Pipeline([
    ("prep", preprocessor),
    ("model", Ridge(random_state=42))
])
grid_ridge = GridSearchCV(
    pipe_ridge,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=3,
    n_jobs=-1,
    verbose=1
)
grid_ridge.fit(X_train, y_train)
best_ridge = grid_ridge.best_estimator_
best_ridge_alpha = grid_ridge.best_params_['model__alpha']
best_ridge_mae = -grid_ridge.best_score_

# 3) Lasso с GridSearchCV
pipe_lasso = Pipeline([
    ("prep", preprocessor),
    ("model", Lasso(random_state=42, max_iter=5000))
])
grid_lasso = GridSearchCV(
    pipe_lasso,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=3,
    n_jobs=-1,
    verbose=1
)
grid_lasso.fit(X_train, y_train)
best_lasso = grid_lasso.best_estimator_
best_lasso_alpha = grid_lasso.best_params_['model__alpha']
best_lasso_mae = -grid_lasso.best_score_

# 4) Вывод оптимальных альфа и CV-MAE
print(f"Ridge: best alpha = {best_ridge_alpha}, CV MAE = {best_ridge_mae:.3f}")
print(f"Lasso: best alpha = {best_lasso_alpha}, CV MAE = {best_lasso_mae:.3f}")

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Ridge: best alpha = 10, CV MAE = 1.711
Lasso: best alpha = 0.001, CV MAE = 1.724


Итог по регрессии длительности госпитализации

После подбора регуляризации получили:

| Модель                        | MAE   | RMSE  |
|-------------------------------|------:|------:|
| Baseline (медиана)            | 1.722 | 2.279 |
| LinearRegression              | 1.722 | 2.279 |
| Ridge (α=1)                   | 1.714 | 2.267 |
| Ridge tuned (α=10)            | 1.711 | 2.265 |
| Lasso (α=0.1)                 | 1.932 | 2.533 |
| Lasso tuned (α=0.001)         | 1.724 | 2.437 |

- **Ridge (α=10)** показал лучший результат: MAE ≈ 1.71 дня, RMSE ≈ 2.27 дня.  
- **Lasso** после тюнинга упала в MAE до ≈ 1.72, но всё ещё уступает Ridge.