# Diabetes Readmission — Capstone Report

**TL;DR:** State the question, data, method, result.

## 1. Data & Setup
- Dataset: add link & license
- Task: define target
- Ethics: PHI privacy, fairness caveats

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

In [None]:
csv_path = '../input/diabetes/diabetes.csv'  # update to your Kaggle dataset path
try:
    df = pd.read_csv(csv_path)
except Exception as e:
    print('Update csv_path or attach a Kaggle Dataset. Error:', e)
    df = None

df.head() if df is not None else None

### If using the UCI Hospitals dataset, map `readmitted` → `target` (1 for `<30`)

In [None]:
if df is not None and 'readmitted' in df.columns:
    df['target'] = (df['readmitted'] == '<30').astype(int)
elif df is not None and 'Outcome' in df.columns:
    df['target'] = df['Outcome'].astype(int)

## 2. Split

In [None]:
if df is not None:
    target = 'readmitted' if 'readmitted' in df.columns else df.columns[-1]
    y = df[target]
    X = df.drop(columns=[target])
    X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if y.nunique()<=20 else None)
    X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp if y_temp.nunique()<=20 else None)

## 3. Baseline

In [None]:
if df is not None:
    num_cols = X_train.select_dtypes(include=[float, int, 'number']).columns.tolist()
    cat_cols = [c for c in X_train.columns if c not in num_cols]
    pre = ColumnTransformer([
        ('num', Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]), num_cols),
        ('cat', Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('oh', OneHotEncoder(handle_unknown='ignore'))]), cat_cols)
    ])
    clf = Pipeline([('pre', pre), ('lr', LogisticRegression(max_iter=200))])
    clf.fit(X_train, y_train)
    p_valid = clf.predict_proba(X_valid)[:,1]
    y_pred = clf.predict(X_valid)
    metrics = {
        'AUROC': float(roc_auc_score(y_valid, p_valid)),
        'AUPRC': float(average_precision_score(y_valid, p_valid)),
        'F1': float(f1_score(y_valid, y_pred, average='binary' if y_valid.nunique()==2 else 'macro')),
        'Accuracy': float(accuracy_score(y_valid, y_pred)),
    }
    metrics

## 4. Error Analysis & Subgroups
Add a few cohort slices and discuss errors.

## 5. Limitations & Next Steps
- Data quality, leakage checks
- Try stronger models (XGBoost/LightGBM)
- Calibrate thresholds to clinical need