## Machine Learning - Assignment 2

***************************************************************************
BITS ID: **2025AA05932**

Name: **Sagar Konde**

Email: 2025aa05923@wilp.bits-pilani.ac.in

***************************************************************************

This notebook compares the following six models for a classification task:
1. Logistic Regression
2. Decision Tree Classifier
3. K-Nearest Neighbor Classifier
4. Naive Bayes Classifier
5. Random Forest (Ensemble)
6. XGBoost (Ensemble)

**Dataset**:\
"Early Stage Diabetes Risk Prediction" dataset from UCI
https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset

In [1]:
pip install ucimlrepo



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Preprocessing and Metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, roc_auc_score, precision_score,
    recall_score, f1_score, matthews_corrcoef,
    confusion_matrix, classification_report
)

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

import warnings
warnings.filter_type = "ignore"

### 1. Dataset Loading and Preprocessing


In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
early_stage_diabetes_risk_prediction = fetch_ucirepo(id=529)

# data (as pandas dataframes)
X = early_stage_diabetes_risk_prediction.data.features
y = early_stage_diabetes_risk_prediction.data.targets


In [4]:
df = pd.DataFrame(X)
df['target'] = y

print(f"Dataset Shape: {df.shape}")

# Encoding target variable
le = LabelEncoder()
df['target'] = le.fit_transform(df['target'])

# One-hot encode categorical features
X_encoded = pd.get_dummies(df.drop('target', axis=1), drop_first=True)
y_encoded = df['target']

# Splitting Features and Target
X = X_encoded
y = y_encoded

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save scaler
joblib.dump(scaler, 'scaler.pkl')

Dataset Shape: (520, 17)


['scaler.pkl']

### 2. Model Training and Evaluation Helper
We create a function to automate the calculation of all 6 required metrics.

In [5]:
results = []

def evaluate_model(name, model, X_tst, y_tst):
    y_pred = model.predict(X_tst)
    # Get probabilities for AUC
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_tst)[:, 1]
    else:
        y_prob = y_pred

    metrics = {
        "ML Model Name": name,
        "Accuracy": accuracy_score(y_tst, y_pred),
        "AUC": roc_auc_score(y_tst, y_prob),
        "Precision": precision_score(y_tst, y_pred),
        "Recall": recall_score(y_tst, y_pred),
        "F1": f1_score(y_tst, y_pred),
        "MCC": matthews_corrcoef(y_tst, y_pred)
    }
    results.append(metrics)

    # Save Model File
    joblib.dump(model, f"{name.lower().replace(' ', '_')}.pkl")

    print(f"Finished evaluating {name}")

### 3. Implementation of All 6 Models

In [6]:
# 1. Logistic Regression
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
evaluate_model("Logistic Regression", lr, X_test_scaled, y_test)

# 2. Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
evaluate_model("Decision Tree", dt, X_test, y_test)

# 3. KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
evaluate_model("KNN", knn, X_test_scaled, y_test)

# 4. Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
evaluate_model("Naive Bayes", nb, X_test, y_test)

# 5. Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
evaluate_model("Random Forest", rf, X_test, y_test)

# 6. XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
evaluate_model("XGBoost", xgb, X_test, y_test)

Finished evaluating Logistic Regression
Finished evaluating Decision Tree
Finished evaluating KNN
Finished evaluating Naive Bayes
Finished evaluating Random Forest
Finished evaluating XGBoost


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


### 4. Comparison Table (For README.md)

In [8]:
comparison_df = pd.DataFrame(results)
display(comparison_df)

comparison_df.to_csv('model_comparison.csv', index=False)

Unnamed: 0,ML Model Name,Accuracy,AUC,Precision,Recall,F1,MCC
0,Logistic Regression,0.923077,0.977379,0.931507,0.957746,0.944444,0.820358
1,Decision Tree,0.951923,0.964789,1.0,0.929577,0.963504,0.898479
2,KNN,0.894231,0.977379,0.954545,0.887324,0.919708,0.769771
3,Naive Bayes,0.913462,0.960734,0.930556,0.943662,0.937063,0.798823
4,Random Forest,0.990385,1.0,1.0,0.985915,0.992908,0.978222
5,XGBoost,0.971154,1.0,1.0,0.957746,0.978417,0.936981
