# Predicting Diabetes Type (Type 1 vs Type 2) — Synthetic Dataset (100 Patients)

**Goal:** Build a simple, end-to-end pipeline to predict whether a person is **Type 1** or **Type 2** diabetic using demographic, clinical, and lab features.

> ⚠️ This is a **synthetic, educational** dataset — not for medical use. The features and labels are simulated with plausible patterns and noise.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, r2_score, mean_absolute_error

pd.set_option('display.max_columns', None)

## Load Data

In [None]:
df = pd.read_csv(r"/mnt/data/diabetes_type_prediction_100.csv")
df.head()

## Structure & Missing Values

In [None]:
df.info()

In [None]:
df.isna().sum()

No missing values are expected, but if any appear, we would impute:
- **Numeric**: median
- **Categorical**: 'Unknown'

## Target Distribution

In [None]:
counts = df['DiabetesType'].value_counts().sort_index()
plt.figure()
counts.plot(kind='bar')
plt.title('Distribution of Diabetes Type')
plt.xlabel('Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

counts

## Feature Engineering (optional)

In [None]:
df['PulsePressure'] = df['SystolicBP'] - df['DiastolicBP']

def bmi_class(x):
    if x < 18.5: return 'Underweight'
    if x < 25: return 'Normal'
    if x < 30: return 'Overweight'
    return 'Obese'

df['BMI_Class'] = df['BMI'].apply(bmi_class)
df[['BMI','BMI_Class','PulsePressure']].head()

## Train/Test Split & Preprocessing

In [None]:
target = 'DiabetesType'
X = df.drop(columns=[target])
y = df[target].astype(int)

numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))])

preprocess = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),
                  ('cat', categorical_transformer, categorical_features)]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

len(X_train), len(X_test)

## Model 1 — Simple Linear Regression (with Thresholding)

We first build a **Linear Regression** model (as requested).
It predicts a continuous value; we convert it to a class using a **1.5 threshold**:
- Predicted Type = 1 if prediction < **1.5**, else 2.


In [None]:
linreg = Pipeline(steps=[('prep', preprocess),
                         ('lr', LinearRegression())])
linreg.fit(X_train, y_train)

y_pred_cont = linreg.predict(X_test)
y_pred_cls = np.where(y_pred_cont < 1.5, 1, 2)

print('Linear Regression evaluation (continuous):')
print('  R^2:', round(r2_score(y_test, y_pred_cont), 4))
print('  MAE:', round(mean_absolute_error(y_test, y_pred_cont), 4))

print('\nLinear Regression (thresholded to classes):')
print('  Accuracy:', round(accuracy_score(y_test, y_pred_cls), 4))
print('  Confusion Matrix:\n', confusion_matrix(y_test, y_pred_cls))
print('\nClassification Report:\n', classification_report(y_test, y_pred_cls, digits=4))

## Model 2 — Logistic Regression (Baseline Classifier)

**Logistic Regression** is a more appropriate baseline for a binary outcome.
We fit it and compare metrics with the thresholded linear regression above.


In [None]:
logreg = Pipeline(steps=[('prep', preprocess),
                         ('clf', LogisticRegression(max_iter=1000))])
logreg.fit(X_train, y_train)

y_pred_log = logreg.predict(X_test)

print('Logistic Regression:')
print('  Accuracy:', round(accuracy_score(y_test, y_pred_log), 4))
print('  Confusion Matrix:\n', confusion_matrix(y_test, y_pred_log))
print('\nClassification Report:\n', classification_report(y_test, y_pred_log, digits=4))

## How to Predict for a New Person

In [None]:
example_person = {
    'Age': 28,
    'Sex': 'Male',
    'BMI': 19.2,
    'SystolicBP': 112,
    'DiastolicBP': 72,
    'FastingGlucose_mgdl': 165.0,
    'HbA1c': 7.8,
    'TotalCholesterol': 170,
    'Triglycerides': 140,
    'ActivityLevel': 'Moderate',
    'Smoker': 'No',
    'RetinaFinding': 'None',
    'FamilyHistory_T2D': 'None',
    'AutoimmuneHistory': 'Yes',
    'GAD65_Antibody': 'Positive',
    'C_Peptide_ngml': 0.6,
    'OnsetAge': 24,
    'DKA_History': 'Yes',
    'PastDiseases': 'Thyroid',
    'PulsePressure': 112-72,
    'BMI_Class': 'Normal'
}

newX = pd.DataFrame([example_person])

pred_cont = linreg.predict(newX)[0]
pred_type_lin = 1 if pred_cont < 1.5 else 2

pred_type_log = int(logreg.predict(newX)[0])

print('Linear Regression predicted continuous:', round(pred_cont, 3))
print('Linear Regression -> Type:', pred_type_lin)
print('Logistic Regression -> Type:', pred_type_log)

## Results, Limitations, and Recommendations

**Results**
- Built two models: a **Linear Regression** with thresholding and a **Logistic Regression** classifier.
- Reported accuracy, confusion matrix, and classification report on a hold-out test set.

**Limitations**
- Data is **synthetic** and small (n=100), so generalization is limited.
- Labels are generated from heuristic rules + noise; not medically validated.
- Linear Regression is not ideal for classification; used here per requirement and for comparison.
- Class balance may vary run-to-run; metrics can fluctuate.

**Recommendations**
- Use **larger, real datasets** with clinically validated labels.
- Prefer **classification models** (Logistic Regression, Random Forest, XGBoost).
- Add **calibration** and **probability thresholds** tuned via ROC/PR analysis.
- Incorporate more features (OGTT results, islet antibodies panel, C‑peptide fasting/stimulated) and time-to-insulin requirement.
