# Project notes
## Overview
This challenge explores equity in chronic condition management by predicting 30-day readmission risk from simulated electronic health record (EHR) data. Your goal is to build a predictive model that can help identify disparities in access and outcomes.

The data has been generated using [Synthea](https://github.com/synthetichealth/synthea) for Arizona patients.

## Task

You are given a dataset of synthetic healthcare encounters including patient demographics, conditions, medications, procedures, and payer type.

Your task is to predict whether a patient will be readmitted within 30 days of a previous encounter.

Target Column: readmitted_within_30_days (binary: 0 or 1)

Note from Instructor: 
- You should submit the probability of readmission, not just a binary 0/1 prediction.
- The competition is evaluated using ROC AUC, which rewards well-calibrated probability scores (e.g., 0.02, 0.76, 0.51) rather than hard 0 or 1 labels.

In [182]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('reference_data/dev.csv')
df.head()

Unnamed: 0,encounter_id,patient_id,age,gender,race,ethnicity,zip,payer_type,has_chronic_pain,has_hypertension,...,has_asthma,has_depression,num_meds,total_med_cost,num_procedures,total_proc_cost,pain_score,height_cm,encounter_cost,readmitted_within_30_days
0,739d0db1-fb93-6999-f85c-c884db3abfef,1a1da106-d526-aad0-37cf-0943f2071253,58,F,white,nonhispanic,85029,PRIVATE,0,0,...,0,0,,,,,,,78.21,0
1,e29e9cdb-37dc-79fe-25ea-fd2e53463c34,9ac8a441-cace-f832-4adf-a2eceaf06333,41,M,white,nonhispanic,85018,GOVERNMENT,0,0,...,0,0,2.0,738.23,7.0,4264.75,1.0,181.6,78.21,1
2,ac8ed32d-fbb6-5759-f7f5-8f109a15a728,2a4c4143-f877-322d-2081-785d8150ba2b,87,F,white,hispanic,85345,GOVERNMENT,0,0,...,0,0,,,,,,,130.36,0
3,8e355cec-2541-c70a-3606-570f2028fac8,893afda8-3e4a-3568-a1c9-4a75571e6689,80,M,white,hispanic,85305,GOVERNMENT,0,0,...,0,0,,,1.0,609.25,,,78.21,1
4,1dd3fc7a-60bd-2857-415f-567fee10909f,39d5a6ca-39b9-7516-a9f0-5d8afc8132dc,13,F,white,nonhispanic,85302,PRIVATE,0,0,...,0,0,1.0,305.92,7.0,4264.75,,,78.21,0


In [183]:
# prep data for ML
# get rid of fields that will not help with predictions
df = df.drop(['encounter_id', 'patient_id'], axis=1)
display(df.info())

X = df.drop('readmitted_within_30_days', axis=1)
y = df['readmitted_within_30_days']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# pull list of categorical/numerical columns for processing
categorical_cols = [cols for cols in X.columns if X[cols].dtype == 'object']
numerical_cols = [col for col in X.columns if col not in categorical_cols]

categorical_cols

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125958 entries, 0 to 125957
Data columns (total 19 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   age                        125958 non-null  int64  
 1   gender                     125958 non-null  object 
 2   race                       125958 non-null  object 
 3   ethnicity                  125958 non-null  object 
 4   zip                        125958 non-null  int64  
 5   payer_type                 125958 non-null  object 
 6   has_chronic_pain           125958 non-null  int64  
 7   has_hypertension           125958 non-null  int64  
 8   has_diabetes               125958 non-null  int64  
 9   has_asthma                 125958 non-null  int64  
 10  has_depression             125958 non-null  int64  
 11  num_meds                   62529 non-null   float64
 12  total_med_cost             62529 non-null   float64
 13  num_procedures             93

None

['gender', 'race', 'ethnicity', 'payer_type']

In [184]:
# create pipeline
# process categoricals
categorical_transformer = Pipeline(
    steps=[
        ("constant", SimpleImputer(strategy='constant', fill_value='missing')),
        ("onehot", OneHotEncoder(handle_unknown='ignore'))
    ]
)

# process numerical
numerical_transformer = Pipeline(
    steps=[
        ("mean", SimpleImputer(strategy="mean")),
        ("standard", StandardScaler())
    ]
)

# combine for pipeline
preprocessing_pipeline = ColumnTransformer(
    transformers=[
        ("categorical", categorical_transformer, categorical_cols),
        ("numerical", numerical_transformer, numerical_cols)
    ]
)

# put together pipeline
ml_pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("model", LogisticRegression())
])

# fit data
ml_pipeline.fit(X_train, y_train)

In [185]:
# create baseline predictions
# Make predictions on validation set
y_pred = ml_pipeline.predict(X_test)

# Evaluate the model using classification metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.3f}")
print("Classification Report:")
print(report)

Model Accuracy: 0.674
Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.28      0.38     11562
           1       0.68      0.91      0.78     19928

    accuracy                           0.67     31490
   macro avg       0.66      0.59      0.58     31490
weighted avg       0.66      0.67      0.63     31490



In [186]:
# gridsearch
display(ml_pipeline)
param_grid = {
    "model__C": [0.1, 0.5, 1.0, 10],
    "model__max_iter": [100, 500, 1000],
    "model__tol": [1e-3, 1e-5],
}
search = GridSearchCV(ml_pipeline, param_grid, n_jobs=2)
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.672):
{'model__C': 0.5, 'model__max_iter': 100, 'model__tol': 0.001}


In [187]:
# TODO: evaluate model with best params and compare to previous
