## Load cleaned data 

### Imports and reproducibility

Here we bring in the basic tools we need.  

- `os` and `sqlite3` help us work with files and a small database  
- `numpy` and `pandas` help us handle numbers and tables  
- `matplotlib.pyplot` lets us make simple plots  
- `sklearn` is the machine learning toolbox with models and metrics  

We also set something called **reproducibility**.  
That means if we run the notebook today or tomorrow we always get the same result.  

Why do we need this  
Some models in machine learning pick random numbers inside.  
If we do not control the random numbers we may get slightly different results each run.  
That makes it hard to compare experiments.  

The line `RANDOM_STATE = 42` sets a seed for the random number generator.  
So each time the random choices are made they follow the same path.  
This makes our experiments repeatable and fair.

In [11]:
# basic imports
import os
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

### Load the cleaned data

We start by loading the cleaned dataset from a CSV file.  
CSV is just a simple text file with rows and columns like a spreadsheet.  

Then we also save this same data into a small database called **SQLite**.  
Why do we do this  
- A database makes it easy to query or filter later  
- It keeps the same workflow as in preprocessing  
- It shows you how data can live in different formats (CSV and database)  

After saving we read the data back from the database into a pandas DataFrame.  
This way we check that everything works and that the data looks the same.  

In the end we print the shape of the table (rows × columns) and show the first few lines.

In [14]:
csv_path = '../data/processed/diabetes_clean.csv'
db_path = '../data/processed/diabetes.db'
table_name = 'diabetes_clean'

# read csv
df_csv = pd.read_csv(csv_path)

# write to sqlite
import sqlite3
conn = sqlite3.connect(db_path)
df_csv.to_sql(table_name, conn, if_exists='replace', index=False)

# read back from sqlite
df = pd.read_sql(f'SELECT * FROM {table_name}', conn)
conn.close()

print('Data shape:', df.shape)
df.head()


Data shape: (99991, 13)


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes,age_group,bmi_category,risk_score,age_bmi_interaction
0,Female,80.0,0,1,never,25.19,6.6,140,0,senior,overweight,1,2015.2
1,Female,54.0,0,0,unknown,27.32,6.6,80,0,senior,overweight,0,1475.28
2,Male,28.0,0,0,never,27.32,5.7,158,0,young,overweight,0,764.96
3,Female,36.0,0,0,current,23.45,5.0,155,0,middle,normal,1,844.2
4,Male,76.0,1,1,current,20.14,4.8,155,0,senior,normal,3,1530.64


### Conclusion
We imported all the tools we need for machine learning and fixed the random seed so results will always be the same each run  
This gives us reproducibility  

We loaded the cleaned dataset from a CSV file and also stored it in a small SQLite database  

With this we are sure the data is ready and stable for the next steps


## Define target and features

We now set the target column to be  **`diabetes`**  
This is the column we want the model to predict (0 = no diabetes, 1 = diabetes)  

We call this column `y`  
All the other columns are called `X` → these are the features the model will use to make a prediction  

So in short  
- `y` = the answer we want to predict  
- `X` = the clues the model can look at


In [12]:
target_col = 'diabetes'
y = df[target_col]
X = df.drop(columns=[target_col])
print("Target column:", target_col)


Target column: diabetes


**Conclusion**  
We clearly told the notebook that our target is `diabetes`  
Now we have a clean split between features (`X`) and target (`y`)  
This makes it ready for the next steps like checking leakage and training models


## Feature leakage check

### Build two feature sets to check leakage

The biggest challenge so far is deciding whether certain features could cause data leakage. 

For example, Insulin levels are often only available during or after diagnostic processes, which may make them less realistic for prediction. (so we already remove them)

Glucose is another concern because it is very highly correlated with the target, and might make the model perform unrealistically well.

On the one hand, glucose is the most important clinical indicator for diabetes. Physicians base the diagnosis primarily on blood glucose levels (for example, fasting glucose or HbA1c).

On the other hand, glucose can also be measured as part of a standard blood test, so it is not always only available when someone is already suspected of having diabetes.
I’m going to test the models with and without  

So we make two versions of our dataset  
- Variant A keeps all numeric features (including glucose)  
- Variant B removes glucose and insulin so we can test if results are more realistic

In [17]:
# split features and target
y = df[target_col]
X = df.drop(columns=[target_col])

# keep only numeric for a simple baseline
X_num = X.select_dtypes(include=[np.number]).copy()

# Variant A: all numeric features (possible leakage included)
X_A = X_num.copy()

# Variant B: remove only blood_glucose_level to reduce leakage
X_B = X_num.drop(columns=['blood_glucose_level'], errors='ignore').copy()

print('Variant A shape', X_A.shape, 'Variant B shape', X_B.shape)


Variant A shape (99991, 8) Variant B shape (99991, 7)


### Conclusion
Variant A uses all features while Variant B excludes blood_glucose_level  
This way we can later compare results and see how much impact this feature has on the model performance


## Train-test split
### Train-test split for both variants


We split the data into training and testing sets  
- Training data is used to teach the model  
- Testing data is kept aside to check how well the model performs on unseen data  

We do this for both Variant A (with blood_glucose_level) and Variant B (without it)  

We use  
- `test_size=0.2` → 20 percent test, 80 percent train  
- `random_state` → for reproducibility  
- `stratify=y` → to keep the same balance of 0 and 1 in train and test


In [18]:
# split data for Variant A
X_train_A, X_test_A, y_train_A, y_test_A = train_test_split(
    X_A, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

# split data for Variant B
X_train_B, X_test_B, y_train_B, y_test_B = train_test_split(
    X_B, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print("Variant A - train:", X_train_A.shape, "test:", X_test_A.shape)
print("Variant B - train:", X_train_B.shape, "test:", X_test_B.shape)


Variant A - train: (79992, 8) test: (19999, 8)
Variant B - train: (79992, 7) test: (19999, 7)


**Conclusion**  
We now have two clean splits  
- Variant A with all features  
- Variant B without blood_glucose_level  

Both have 80 percent of the rows for training and 20 percent for testing  
This makes the setup ready for model training


## Modeltraining Variant A (probably leakage)


We start with very simple models so we keep things clear  
We fill empty values with the median  
We scale the features only for logistic regression  
We train three models: 
- logistic regression  
- decision tree  
- random forest  

Then we check simple scores  
accuracy precision recall f1 and roc auc  

In [19]:
# fill missing values with median
X_train_A_imputed = X_train_A.fillna(X_train_A.median(numeric_only=True))
X_test_A_imputed = X_test_A.fillna(X_train_A.median(numeric_only=True))

# scale only for logistic regression
from sklearn.preprocessing import StandardScaler
scaler_A = StandardScaler()
X_train_A_scaled = scaler_A.fit_transform(X_train_A_imputed)
X_test_A_scaled = scaler_A.transform(X_test_A_imputed)

# make models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

log_reg_A = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
tree_A = DecisionTreeClassifier(random_state=RANDOM_STATE)
rf_A = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE)

# fit models
log_reg_A.fit(X_train_A_scaled, y_train_A)
tree_A.fit(X_train_A_imputed, y_train_A)
rf_A.fit(X_train_A_imputed, y_train_A)

# helper to compute simple metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def eval_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    proba_ok = hasattr(model, "predict_proba")
    score = None
    if proba_ok:
        score = model.predict_proba(X_test)[:, 1]
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, zero_division=0),
        "recall": recall_score(y_test, y_pred, zero_division=0),
        "f1": f1_score(y_test, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_test, score) if score is not None else None
    }
    return metrics

# evaluate
metrics_log_A = eval_model(log_reg_A, X_test_A_scaled, y_test_A)
metrics_tree_A = eval_model(tree_A, X_test_A_imputed, y_test_A)
metrics_rf_A = eval_model(rf_A, X_test_A_imputed, y_test_A)

print("Logistic A", metrics_log_A)
print("Tree A", metrics_tree_A)
print("RF A", metrics_rf_A)


Logistic A {'accuracy': 0.9573978698934946, 'precision': 0.9820627802690582, 'recall': 0.43887775551102204, 'f1': 0.6066481994459834, 'roc_auc': np.float64(0.9583984926578556)}
Tree A {'accuracy': 0.9304965248262413, 'precision': 0.5342729019859065, 'recall': 0.5571142284569138, 'f1': 0.5454545454545454, 'roc_auc': np.float64(0.76118727564306)}
RF A {'accuracy': 0.9526976348817441, 'precision': 0.8184971098265896, 'recall': 0.4729458917835671, 'f1': 0.5994919559695173, 'roc_auc': np.float64(0.9406687478657817)}


## Modeltraining Variant B (without possible leakage)

Now we train the same three models on Variant B  
Variant B does not include the column blood_glucose_level  
This gives us a more honest test because we removed the possible leakage  

We again  
- fill missing values with the median  
- scale only for logistic regression  
- train logistic regression, decision tree and random forest  

Then we check the same scores  
accuracy precision recall f1 and roc auc


In [20]:
# fill missing values with median
X_train_B_imputed = X_train_B.fillna(X_train_B.median(numeric_only=True))
X_test_B_imputed = X_test_B.fillna(X_train_B.median(numeric_only=True))

# scale only for logistic regression
scaler_B = StandardScaler()
X_train_B_scaled = scaler_B.fit_transform(X_train_B_imputed)
X_test_B_scaled = scaler_B.transform(X_test_B_imputed)

# make models
log_reg_B = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
tree_B = DecisionTreeClassifier(random_state=RANDOM_STATE)
rf_B = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE)

# fit models
log_reg_B.fit(X_train_B_scaled, y_train_B)
tree_B.fit(X_train_B_imputed, y_train_B)
rf_B.fit(X_train_B_imputed, y_train_B)

# evaluate
metrics_log_B = eval_model(log_reg_B, X_test_B_scaled, y_test_B)
metrics_tree_B = eval_model(tree_B, X_test_B_imputed, y_test_B)
metrics_rf_B = eval_model(rf_B, X_test_B_imputed, y_test_B)

print("Logistic B", metrics_log_B)
print("Tree B", metrics_tree_B)
print("RF B", metrics_rf_B)


Logistic B {'accuracy': 0.9573978698934946, 'precision': 0.9835082458770614, 'recall': 0.43820975283901137, 'f1': 0.6062846580406654, 'roc_auc': np.float64(0.9583931311619744)}
Tree B {'accuracy': 0.9363968198409921, 'precision': 0.5791695988740324, 'recall': 0.5497661990647963, 'f1': 0.5640849897189856, 'roc_auc': np.float64(0.7782430244411281)}
RF B {'accuracy': 0.9505475273763688, 'precision': 0.7679324894514767, 'recall': 0.4863059452237809, 'f1': 0.5955010224948876, 'roc_auc': np.float64(0.9316041371829523)}


## Compare results of Variant A and Variant B

| Model              | Variant | Accuracy | Precision | Recall | F1   | ROC-AUC |
|--------------------|---------|----------|-----------|--------|------|---------|
| Logistic Regression | A       | 0.9574   | 0.9821    | 0.4389 | 0.6066 | 0.9584 |
| Decision Tree       | A       | 0.9305   | 0.5343    | 0.5571 | 0.5455 | 0.7612 |
| Random Forest       | A       | 0.9527   | 0.8185    | 0.4729 | 0.5995 | 0.9407 |
| Logistic Regression | B       | 0.9574   | 0.9835    | 0.4382 | 0.6063 | 0.9584 |
| Decision Tree       | B       | 0.9364   | 0.5792    | 0.5498 | 0.5641 | 0.7782 |
| Random Forest       | B       | 0.9505   | 0.7679    | 0.4863 | 0.5955 | 0.9316 |

**Conclusion**  
- Logistic Regression performs almost the same in both Variant A and B  
  → very high accuracy and precision, but recall is low (it misses many true cases of diabetes)  
- Decision Tree improves slightly in Variant B (better precision and f1, higher roc-auc)  
- Random Forest is strong in both, with very similar results  
- Removing `blood_glucose_level` did **not drastically change** the performance, so the model is not fully dependent on that single feature  

In simple words  
The models do not collapse without blood_glucose_level  
This means they can still learn useful patterns from other features  
But recall is low for Logistic Regression and Random Forest → they catch fewer true positives  
Decision Tree balances precision and recall better but with lower overall accuracy  

### Conclusion
Looking at the results  

- **Logistic Regression** has the highest accuracy and precision  
  But recall is low → it misses many people who actually have diabetes  
- **Decision Tree** has the best balance between precision and recall  
  But overall accuracy is lower  
- **Random Forest** sits in the middle → good accuracy, better recall than Logistic Regression, and solid roc-auc  

If the goal is to **catch as many true diabetes cases as possible** (high recall), then **Random Forest** is the best option  
If the goal is to **avoid false alarms** (high precision), then **Logistic Regression** is a strong choice  
For a balanced trade-off, **Random Forest** is the safest pick  

### Final Choice 
We will continue with **Random Forest on Variant B**  
This model gives the most reliable results  
It combines good accuracy with better recall than Logistic Regression  
and it does not depend too much on blood_glucose_level  
This makes it a fair and realistic choice for our final pipeline


## Best model + pipeline

### Finalize and save the Variant B pipeline

We finish with Random Forest on Variant B  
We fill missing values with the median  
We train on the full Variant B dataset so the model sees all rows  
We save the model with joblib together with the feature list  
We also make a simple feature importance chart so we see which clues matter most


In [21]:
# prepare full Variant B data
X_full_B = X_B.fillna(X_B.median(numeric_only=True))
y_full = y

# train final random forest on all Variant B rows
rf_final = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE)
rf_final.fit(X_full_B, y_full)

# save pipeline parts
import joblib
joblib.dump(
    {
        "model": rf_final,
        "features": X_B.columns.tolist(),
        "imputer": "median by column",
        "variant": "B without blood_glucose_level"
    },
    "/mnt/data/diabetes_clf_pipeline.pkl"
)

print("saved to /mnt/data/diabetes_clf_pipeline.pkl")

# quick feature importance visual
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

importances = pd.Series(rf_final.feature_importances_, index=X_B.columns).sort_values(ascending=True)
topk = min(10, len(importances))
imp_top = importances.tail(topk)

plt.figure()
imp_top.plot(kind="barh")
plt.title("feature importance random forest variant B")
plt.xlabel("importance")
plt.ylabel("feature")
plt.tight_layout()
plt.show()


FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/diabetes_clf_pipeline.pkl'

## Save pipeline