# Things to do
- EDA: inspect columns, handle missing data, define target
- Build baseline Logistic Regression model
- Add Decision Tree + Random Forest, compare scores
- Model selection + tuning (max_depth, n_estimators, etc.)
- Train final model & save with pickle
- Implement FastAPI service for inference
- Dockerize the app (Dockerfile)
- Test locally + via curl or Postman
- Optional cloud deploy (Render / Railway / Fly.io)

In [69]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LogisticRegression

In [70]:
# import data
df = pd.read_csv('../data/diabetic_data.csv')

## Exploratory Data Analysis
- Inspect columns
- Handle missing data

### initial thoughts
- missing values with '?'- check their data types, and maybe fill with 0 or drop if missing values is a lot
- missing values with 'NAN'- check their data types, and maybe fill with NA or drop if missing values is a lot
- readmitted is the target variable, maybe convert to binary 0 and 1
- do I need to do some visualizations?

In [72]:
# Define the target variable: readmitted
# 'NO' → not readmitted
# '>30' → readmitted after 30 days
# '<30' → readmitted within 30 days

df['readmitted'] = df['readmitted'].apply(lambda x: 1 if x=='<30' else 0 )

In [73]:
df = df.replace('?', np.nan)

cat_cols = df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=['number']).columns

for c in cat_cols:
    df[c] = df[c].str.lower().str.replace(' ', '_')

In [74]:

df[cat_cols] = df[cat_cols].fillna('NA')
df[num_cols] = df[num_cols].fillna(0.0)

In [76]:
age_map = {
    '[0-10)': 5,
    '[10-20)': 15,
    '[20-30)': 25,
    '[30-40)': 35,
    '[40-50)': 45,
    '[50-60)': 55,
    '[60-70)': 65,
    '[70-80)': 75,
    '[80-90)': 85,
    '[90-100)': 95
}
df['age'] = df['age'].map(age_map)

In [77]:
df['age'].unique()

array([ 5, 15, 25, 35, 45, 55, 65, 75, 85, 95])

In [78]:
df.head().T

Unnamed: 0,0,1,2,3,4
encounter_id,2278392,149190,64410,500364,16680
patient_nbr,8222157,55629189,86047875,82442376,42519267
race,caucasian,caucasian,africanamerican,caucasian,caucasian
gender,female,female,female,male,male
age,5,15,25,35,45
weight,,,,,
admission_type_id,6,1,1,1,1
discharge_disposition_id,25,1,1,1,1
admission_source_id,1,7,7,7,7
time_in_hospital,1,3,2,2,1
