
# UCI Pima Indians Diabetes — Weeks 1–2 Template

**Goal:** Practice end-to-end data science workflow — load → explore → clean → baseline models → reflect  
**Dataset:** [Pima Indians Diabetes (UCI ML Repository)](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)  
**Target:** `Outcome` (1 = diabetes, 0 = no diabetes)


## 0. Setup

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay


## 1. Load the Data

In [4]:

# TODO: Adjust path if needed
df = pd.read_csv("diabetes.csv")
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'diabetes.csv'

## 2. Inspect the Data

In [None]:

df.shape
df.info()
df.describe()
df['Outcome'].value_counts(normalize=True)


## 3. Data Quality — Missing & Impossible Values

In [None]:

cols_with_zero_invalid = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

(df[cols_with_zero_invalid] == 0).sum()

df_clean = df.copy()
df_clean[cols_with_zero_invalid] = df_clean[cols_with_zero_invalid].replace(0, np.nan)
df_clean.isna().sum()


## 4. Visualize

In [None]:

df_clean.hist(figsize=(12,10), bins=20)
plt.show()

for col in df_clean.columns[:-1]:
    df_clean.boxplot(column=col, by="Outcome", figsize=(6,4))
    plt.title(f"{col} by Outcome")
    plt.suptitle("")
    plt.show()


## 5. Train/Test Split

In [None]:

X = df_clean.drop("Outcome", axis=1)
y = df_clean["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## 6. Preprocess (Impute + Scale)

In [None]:

preprocess = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])


## 7. Baseline Models

In [None]:

models = {
    "LogReg": LogisticRegression(max_iter=1000),
    "DecisionTree": DecisionTreeClassifier(random_state=42),
    "KNN": KNeighborsClassifier()
}

results = []

for name, model in models.items():
    pipe = Pipeline([("prep", preprocess), ("clf", model)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    
    results.append((name, acc, prec, rec, f1))

pd.DataFrame(results, columns=["Model","Accuracy","Precision","Recall","F1"])


## 8. Diagnostics (Confusion Matrix)

In [None]:

best_model = LogisticRegression(max_iter=1000)
pipe = Pipeline([("prep", preprocess), ("clf", best_model)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

print(classification_report(y_test, y_pred, digits=3))


## 9. Reflection


- Which features look most predictive?  
- Where did you spend the most time (cleaning vs. modeling)?  
- What could be improved (feature engineering, thresholds, new models)?  
- What biases might exist in this dataset?  
